Being Data Driven At Stripe With Trino And Iceberg

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Data lakes are notoriously complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end to end data lake has platform built on Trino, the query engine Apache Iceberg was designed for.

Starburst has complete support for all table formats, including Apache Iceberg, Hive, and Delta Lake.

And Starburst is trusted by teams of all sizes, including Comcast and DoorDash.

Want to see Starburst in action? Go to data engineering podcast.com/starburst

today and get $500 in credits to try Starburst Galaxy, the easiest and fastest way to get started using Trino.

Your host is Tobias Macy, and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse. So, Kevin, can you start by introducing yourself?

Of course. Hey, everyone. My name is Kevin Liu. I'm currently a software engineer at Stripe.

In the past 3 years, I've been working in the big data

infrastructure ecosystem, working primarily with Trino and Iceberg.

Recently,

I've taken on a new challenge working in data sharing,

which is really awesome.

I'm here to talk about use cases for Trino and Iceberg.

And do you remember how you first got started working in data?

I just stumbled upon it, honestly.

I think,

I started a new job at Stripe and put into a team working with Trino and Iceberg and had no idea what it was.

And then it was just, you know, a bunch of learning, a bunch of, like,

contribution to open source, working with the community,

and getting to know the technology.

And in that context of

Trino and Iceberg, the ways that it's being applied at Stripe, I'm wondering if you can just start by giving a bit of an overview about how it's being used, its overall position and responsibilities

within the broader data architecture of Stripe,

and some of the

initial experiences that you had getting onboarded into that ecosystem?

Mhmm. Yeah.

So a lot of the context

before my time, I've I've only read and, like, have, you know, historical context on. But I can give you a overview of, like, what it what it looks like today.

So tune on iceberg is what we use majority for

business analytics,

data analytics,

dashboarding,

anything to do with reading your big, big data dataset.

So a lot of transformation,

a lot of dashboarding presentation is done on the Trino layer.

And that's kind of there's a clear distinction between our use for Trino and our use for, let's say, something like Spark.

Right? Like, Spark is used for transformation.

It's used for writing,

and and transforming big datasets

versus, our current use case right now for Trino, which is, like, reading.

So that that's the big distinction. And, you know, historically,

we actually move away from Redshift

into Trino.

And,

you know, the we've kind of reached a scale

and the need to

outscale

Redshift.

And

Trino,

the distributed nature of it, and the fact that you can scale up,

and scale out

TreeNow clusters

and and have it work on your big data dataset,

really helped us kind of scale our organization,

scale our need for, like, business

intelligence and data analytics.

So that's kind of the the the context behind it and, you know, which currently you can

like, everyone who is using any kind of data and doing any kind of analytics at Stripe uses Trino to power that in the in the back end.

Given the very read heavy nature,

and I know that there has been a lot of work in recent years put into being able to optimize

for

things like query caching,

read speeds, etcetera. I know that Iceberg helps in that, but I'm curious what are some of the

edge cases that you've run into and that you've seen other people run into

in that

very analytics heavy, read heavy environment of being able to query across these large datasets. And maybe if you're able to give some sense of what large means in your context and just some of the ways that you've started to hit against some of those limitations in your experience.

Yeah. I think,

going going on the the kind of how big the data is. Right? And, like, what is what is the bottleneck that we were facing at Stripe? I think 1, you know, the the entire Stripe

ecosystem of data is is massive. Like, you know, like, petabytes of data

or more. And kind of doing the analytics for that, joining joining data, making sure that we, you know, we can do,

reporting, we can do

operational analytics.

A lot of that is is powered by Trino,

and it's powered by, you know, many, many clusters

running many, many machines.

And I think 1 of the bottlenecks we face was actually the concurrency

nature of of, like, running queries. Right? It it is really popular for people at Stripe to just pop into, like,

an internal website,

write some query, write some SQL query against our our big data store,

and, you know,

find some kind of data and results from that. And it's to the point where it's, like, almost a central repository where, you know, if you want data to be easily accessible, you dump it into this ecosystem.

So then everyone else can just go to a centralized website and and query for that data. So, yeah, I touch up on, like, kind of the the concurrency nature of it, which is,

from what I've read, the kind of bottleneck for Redshift

is we you know, when you have thousands of people

all trying to query at once,

that's that's something that Redshift wasn't good at at the time.

And because we can scale out, Trino, you know, have multiple clusters, have multiple,

machines backing that compute.

And, you know, Trino is made for this kind of, like, very fast ad hoc analytics

that really helped kind of know,

okay

you know, okay with, like, working working in this realm.

And then another bottleneck we faced was kind of the whole, like, high versus iceberg.

Right? Like, as with any organizations who started, like, you know, 5, 8, 10 years ago,

Hive was kind of the the main aspect of data

platform and especially in big data. Right? Like, everyone writes on Hive. Spark writes to Hive. Trino used to have Hive connector

as, like, the the,

initial, like, component of, like, the Trino connector.

But then eventually, everyone moved from Hive to Iceberg, and there's a clear reason for that. I think 1 of those,

and, you know, people repeat this over and over again, is,

how you how, like, Hive handles partitioning

and the fact that you have to, like, list in in a blob store like s 3. And it's just, like, created in a different paradigm from before. Right? Like, Hive was made for Hadoop, which has a more efficient list,

operation. And, you know, Hive for s 3 and Blobstore,

now listing becomes very expensive.

And a very real concern when you're working with big data is, like,

partitioning and having to list all of the partitions.

And, you know, when you do

more partitioning, like multiple levels of partitioning, let's say, I wanna partition by, like, a time stamp plus, like, I don't know, some some kind of, like, clustering ID.

It just, like, blows up the kind of complexity of, like, how much listing you have to do.

So first of all, listing in s 3 is really expensive.

Listing,

s 3 files really add up in terms of cost.

Secondly, it's just not efficient. You know? Like, listing, you know, if you have, like, a 1000 files and you have to list them every time, it becomes really tedious and and and, like, it is memory,

it's memory intense, plus it's very slow.

And when you're using the Hive ecosystem,

a lot of the time, you have the Hive metastore as

the the kind of metadata layer.

A Hive Metastore, when you're listing thousands of files, really can just, like, blow up on you,

which is not fun as a big data,

engineer.

But, yeah, I think a lot of a lot of the reasons,

we picked and we ended up with Trino and Iceberg has to do with all of these, like, underlying bottlenecks.

And it really helped us to,

use these technologies to, like, solve for those bottlenecks.

Digging a bit more into the specifics of your

infrastructure, it sounds like you are using s 3 or some analogous blob store as the storage layer.

Obviously, you're using iceberg

as the table format. I'm wondering if you are using the Hive connector for Iceberg as the means of storing the pointers to the appropriate metadata, or have you moved into the,

rest catalog functionality that Iceberg has been adding in recently and some of the ways that you think about that decision.

Yeah. I think, a very general overview is you know, I already alluded

Spark for, writing and transformation,

Trino for reading the data, and then on the kind of

data file side and the kind of table format side,

we currently use, well, we use Blobstore as 3 8 4 for

storing the actual data.

But on the table format side, we're actually kind of stuck in between this, like, hive and iceberg

land where we use both, and we we're still, like, slowly migrating from hive to iceberg.

So, you know, it which means for

both Spark and Trino,

we are we have to use, both Hive and Iceberg because some tables are in Hive format, some tables are in Iceberg format.

And there's a slow transition

to to move everything into iceberg.

And, you know, both Hive and Iceberg

require a kind of a metadata catalog,

and we started with, Hive Metastore. That's, like, the the thing that Hive requires you to to have to run this.

And, actually, Iceberg,

the initial version of Iceberg can run off Hive Metastore because,

when it was first implemented, it was, like, a direct replacement for for Hive.

Right? So we're still stuck in the, like, we're using Hive metastore,

but, you know,

there's a slow migration to to, like, how do we get out of the Hive metastore? How do we get out of the Hive table format?

And, you know, as you alluded to, I think the answer is the rest catalog.

I mean, I actually wrote some some subset articles on this basically saying that, hey, you know, even if you currently have the Hive metastore,

you can add a rest catalog,

component on top of that

so that it's a level and direction between your compute and your catalog.

And as of now, Trino, Spark, and a lot of other engines all support this REST catalog.

So

from a big data engineer perspective,

the REST catalog really helps to decouple

the the engine and the underlying table format, and it allows us to

do the migration kind of transparently,

without our users kind of knowing what's going on in the background. So it gives us it gives us a tool to in order for us to, like, create all of these, like, migrations and improvements and

optimizations.

And I was just noticing recently too that you

actually are at least involved in or if if not completely authored the Python iceberg rest catalog. I'm wondering what were some of your inspiration or motivation to put in that work and some of the ways that you're hoping to see that used?

Yeah. I think, I'm really excited for the REDS catalog, honestly, because

just from an industry standpoint,

everyone

has standardized on this format.

So, you know, you can go to Spark. You can go to Snowflake. You can go to, like, any engines or vendors, and they have a plug in for the RENS catalog.

Right? Which means there's a standard now where you can say, I have this this concept for a rest catalog, and I can take that and go wherever.

Right? And this really helps with the the, like, you know, avoiding vendor lock in, making making it so that it's, like,

very easy to transition from 1 platform to another. And it really, like, levels the playing field for, like, what the vendor

should provide. Right? Like,

if, if it's so easy for me to take my data and go somewhere

else, my vendor, my current vendor better give me the best available

technology,

like, optimization, everything. Because whereas if someone else is better, I could just go there. But yeah. And then coming back to the Python iceberg, rest catalog

implementation,

that's been on my mind for quite a while just because,

you know, for

the majority of my time at at Stripe, I've been dealing with the Hive metastore,

making sure it doesn't blow up, looking at, like, you know, what it's doing in the ecosystem.

And I really think this, like, rest catalog

idea is a

direct replacement,

if not more, for the Hive meta store. And there's, like, a lot of interesting thing you can do once everyone's on this kind of,

technology slash implementation.

Moving back into some of the work that you're doing at Stripe, the ways that you are taking advantage of the Trino

and Iceberg combination.

I'm wondering what are some of the projects that you have been

most excited about or most interested by in your work at Stripe and some of the

specifics of Iceberg, Trino, the combination thereof that have

led you to being able to employ them beyond just the very simple, I have my data somewhere. I can query it.

Right. Yeah. I think the the biggest thing that I've, I'm, like, really proud of is

utilizing

these 2 technologies

and kind of how do you say it? Like, syncing them in in in a in a way where it's not just, you know, Trino queries iceberg and get the result. Like, Trino itself is a really powerful engine. It allows you to

read from whatever data source that you can connect to. Right? You can say, I have a Postgres database. I can connect that to Trino and magically

now you can write some SQL against

a Postgres database somewhere else. Right? So

the idea that that, you know, we we're working with Stripe, the the the thing that I'm really proud of is kind of using this ecosystem and plugging in things all the way back to Trino. So I'll I'll give you a concrete example. Right? Like, I talked about the Hive metastore.

I talked about, like, Iceberg. Iceberg itself has a lot of metadata.

Because it's a table format, it can store

more than just the underlying data. Right? It could gives you it can give you the the details on, like, partition scheme on the the schema, on metadata information about how big your table is, like, what is the distribution of your table, what is the min max, like, all the stats.

And

all of this metadata

in Iceberg Land is in, like, metadata files.

Right? And as a, you know, big data engineer trying to figure out what's going on for a specific iceberg table and digging into it,

it's it was really difficult to, like,

look at a specific iceberg table and do some diagnostic on, like, what's going on in that table. Right? So 1 of the things that, you know, I created at at Stripe was to plug in that metadata ecosystem back into Trino.

So then I can now write SQL against that and, like,

diagnose

my iceberg table using Trino.

So there's, like, 2 aspect of this. 1 is Trino has this idea of, like, metadata table

where, you know, instead of reading the data itself, you can read the metadata of the table format.

So, like, things like partitioning, I could just do select all from the table, like, dollar sign partition.

And now instead of reading the data, it gives me partition data back. And it's in a a tabular format, so now I can join it if I want. I can, like, aggregate it and perform any kind of analytics I want. So that's that's 1 piece of it where, you know, you can plug the metadata back into Trino itself.

The second piece is

on this, like, catalog slash, like, Hive metastore piece. Right? Like, Hive Metastore is essentially a catalog of

all of the data and table that's registered at at Stripe. That's, you know, that you're able to read at Stripe.

And the idea is, like, how do we analyze data so then, you know, there there's a easier way to, like, figure out what's going on in our ecosystem.

Right? So,

for example, like, the Hive metastore itself is backed by underlying database. For us, we use Postgres.

Right?

I want to read that Postgres data and let's say figure out,

how many tables are there in total at Stripe. Right? How many hive tables? How many iceberg tables?

So what we did was we plugged the underlying

database itself back into Trino.

So then now I can query that database

to say, hey. Like, what is this table?

What is the last time it was updated?

Like, these metadata are all located in the database.

And by plugging it in directly to Trino, I can query it in real time because underneath, it's issuing a table directly to that, to that database.

Mind you know, you should create a read only database if you're doing that for for your for your backing store. And the third piece is

we took these 2 idea and we we looked at this and we say, okay. This is great. Like, we have a pretty

nice, like,

infrastructure

component to

plug back into Trino. What can we do about that?

And, you know, we're we're running this, like, infrastructure

essentially to provide

the query like, SQL query capability at Stripe. And a big thing we're facing at the time is kind of analytics

on what is being run on Trino,

like, how much CPU it's using, how much memory it's using. Are we blowing up a cluster? Like, are we optimizing the cluster?

So the idea is to, like, log each query that's run on Trino. And,

you know, Trino has a nice

connector that gives us that information

along with, like, a lot of information on, like, how much memory, how much CPU, like

like a bunch of usage information from Trino itself.

So what we did is we took that information in the in the connector.

We dumped that into another database, like a Postgres database.

So every time a

a query is run, we would, like, observe a row that says, here's the information about that,

about that query into our Postgres database. And then we plug that back into Trino.

So then now on Trino, you can say, okay. In the last 5 minutes,

how many queries were run?

Right? And what like,

this is something that we are interested in specifically

is, was there a spike in,

how many queries were submitted in the last 5 minutes?

And now you have, like, almost real time near real time analytics on, like, everything that's run on your platform. And that's really helpful for us from, like,

the infrastructure standpoint because now it gives us more observability,

and it gives us just a component in our platform that we can add more information to. Like, for example,

we open up Trino to, like,

services within Stripe as well.

Right? So now we can say, okay. Well,

how many

like, what is the resource usage of this specific core, of this specific service

and what kind of query is it running? And it's just like a nice piece of component that we can now, like, add more and more metadata in and use that for, like, analytics.

And

with that

meta capability, the meta cognition, if you will, about your data platform being able to

round trip all of this information through that system, I'm wondering what are the types of

use cases that that supports

and some of the types of questions that people are asking and answering about the data platform itself using those capabilities?

Yeah. I think,

a big 1, and and we always get asked this is,

you know, a data producer

now has a v 2 version or an upgraded version of the table.

Right? And the the main question is, hey, I wanna deprecate this, but I don't wanna break

everything. So, you know, can I

go and,

you know, figure out who's using this table?

Either tell them to, hey, there's a new 1

or just figure out, like, you know, when I do deprecate this, it's not going to break any, like, obscure work stream. Right?

So for us,

being able to, you know,

analyze every query that was run on the platform and being able to analyze, like,

exactly what was what tables were used,

And that's a capability that Trino has, for you know, to to give us, like, you know, in this query,

which tables were used, which tables were queried.

And that, like, once we expose the information,

like, that use case is, like, super easy to to to achieve. Right? Like, for me, I could just go, like, you know, select all from this, like, create database where

the affected table is like this thing.

And then,

what I talked about before with, like, the metadata

on on top like, included in this query, we can actually

point back to where this query was originated.

Right? It could be, like, this service is actually running this query, which has this table,

or, you know, these people are are querying

it or it's like this dashboard that we're using

for for, like, you know, you querying this table.

So then now you have a pointer back to the original use case, and now you can say, okay. Well, you know, let me let me email everyone to say, hey. Like, we're deprecating this.

So that that's like a very,

common use case that we have is is to say, like, you know, table deprecation, table translation.

And and there are a few more that's more, like, on the infra side. Like, for example,

like, rate limiting and, like, making sure that people are using their fair share of, like, CPU

and and memory consumption,

making sure that, like, people's queries are optimized, that, you know,

please include a a partition filter on your thing so you don't read, like, 10 years of data.

They're they're, like, things like that that we can use historic

historical, like, query information on in order to, like, help us make decisions

for, like, current use cases.

Because of the fact that you have Trino as the read path, you have Spark as the write path, I'm sure that there are other systems that may be able to hook into those iceberg tables.

What are some of the ways that you are accounting for the fact that maybe not absolutely every

read path or not absolutely every

point of access to a particular data location does go through Trino,

in the course of being able to determine, can I safely deprecate this table? Can I delete it? Can I stop feeding data into it? Just some of the ways that you are

managing the multi tool nature of the ecosystem, both at Stripe and just in the in in data more generally.

Yeah. I think,

everything becomes hard once you start adding new tools. Right? So, you know, Flink is a is a very real use case for, like, streaming and real time analytics.

Adding Flink to the mix

adds an extra level of complexity in this ecosystem.

So, you know, back to your question about, you know, Spark on the re Spark on the right path, Trina on the repath, like, how do you know that you're covering all the bases for, like, the usage for, like, let's say, 1 particular table. Right? And this is where kind of,

I'm really excited about the REST catalog and and this, like, iceberg notion of, like, having a catalog

is because

you can really push down everything I just described. And I say push down because I imagine, like, Trino on on top and, like, you know, the table formats below it, you can really push down all of the features on a catalog level. Right? So things like logging

usage.

Right? The catalog technically

knows

exactly when a table is used. Right? Because you're using the catalog for that. Right? Like, for for Trino to say,

I want to read a table,

it has to go to the catalog, grab that information

before coming back to say, hey, this is where the table is. And so we're really seeing this, like, catalog

as the centralized place

when you have multiple engines.

Right? You can implement an engine specific feature when you don't have many engines. You know, I can implement everything I said once in Trino and another in in Spark.

But when you have Flink or another engine, you gotta redo everything else again.

Right? So the idea is

if we can push

all of these features

lower

in the stack,

then

the whole ecosystem

is kind of engine agnostic.

So then we can plug in another engine

and still get the same kind of feature set

without any changes to our ecosystem.

We're not there yet, but that's kind of the idea. And you see,

like, catalog vendors

kind of have similar idea. Like,

Tabular was really into this, like, idea of security

on the catalog layer. So you don't have to reimplement security

in every

single engine

and have customized code for every single engine because, you know, it's it's a lot of effort engineering wise and,

you know, there's nuances to that. But if you can't lock it down in a catalog layer, then you can bring that catalog to wherever engine and get the same behaviors.

In that

future state

of the catalog

knows all of this information about the ways that the data is being accessed

and also to the work that you were doing with your rest catalog implementation.

I imagine that a lot of that information can be captured based on the interaction

with the interface

by which the catalog is

queried or interacted with. And I think that by moving more into that REST API and away from the Hive metastore space that it makes it a little bit easier to implement that as

maybe not necessarily

across the board. Every implementation is going to do it a bit, but it makes it easier to put that logic into

the REST API layer so that you can have some of that some of the information about what are the access patterns, how often is it happening, what are the sources by which these accesses are coming from?

Yeah. I think a lot of

the feature sets and the ideas

for this, like, catalog layer

comes from Hive Metastore. Like Hive Metastore contains a lot of feature sets, and this is like good and bad, right? Like there's like a multitude of feature set that were implemented

on Hive Metastore because it was like, you know, the the centralized place that everyone runs, and it's like the meeting spot that everyone comes to. Right? And and it itself has some, like,

bottlenecks in terms of, like, performance, but the idea still holds true.

Right? Like, we were, you know, with the Hive metastore, we were doing

our customized code for,

you know, like, security,

for authorization,

for, like, you know, data

what's the word? Like,

we we have, like, this idea of, like, data regionality.

Like, data has to be in specific region, and we're enforcing that in the Hive Metastore layer.

But it's the same idea. Right? Like, now you just replace Hive Metastore with, like, the rest catalog. I mean, you have to be within the iceberg ecosystem, but

the rest catalog is is like

in in similar vein, doing performing the same duty

where it's the centralized place that every every engine has to come to in order to do some kind of work. And, you know, like, in the Delta table world, like, Unity Catalog is doing the exact same thing. Right? Like, Unity Catalog is

this this piece of code that you have to, like, reach out to

when you try to do some kind of write in the Delta world.

Right? So

I see it as, like, you know, rest catalog is the basis. It's just, you know, some a a spec with the implementation.

But

with that, you can add a a bunch of feature sets to, like, improve upon this idea.

So we've talked a lot about the ways that Trino and Iceberg are useful both in isolation

and in combination.

What are some of the aspects of

those technologies

that

are still a pain point and some of the ways that you hope to see them evolve

to improve the overall experience of

working with them, some of the features that maybe you think should be pushed into 1 or the other,

and maybe some of the features that could be pushed into 1 of those layers, but definitely shouldn't because then you end up with some bloated monstrosity that does too many things.

Right.

I think, first of all, like, managing Trino is really hard,

especially for a big organization

with

growing needs and,

growing, you know, bottlenecks for, like, this and that. And just managing the infrastructure itself becomes

tedious, almost like a operational job. So that that's 1 thing. Right? Because we run Trino. We run open source Trino ourselves.

So a lot of the operational burden is is on is on the team.

Right? And because of

how widely used it is and how critical it is to, like, operation daily operation of of the company,

it becomes critical for us to run this

very well

with, like, minimal downtime with, you know, very strict SLA.

So that's that's the hard part. And, you know,

maybe we'll explore it, like, using, like, a vendor solution or, you know, something that will alleviate

the

the,

the maintenance burden from us. Right? And there's, like, you know, Starburst is 1 of them, Athena.

There's, like, other players in this where, you know, we can say, okay. Well, you

know, we can't handle the kind of

tooling and the ecosystem around, but running the underlying infrastructure. And for us, it's just like a black box. Right? We just run some compute nodes, and we we send SQL in and results come out. So, like, if we can,

delegate that somewhere else, I think it will alleviate a lot of the maintenance burden burden for us. And I think the other thing on iceberg

is that

I think managing iceberg, it's still difficult

because

this idea of, like,

you know, you have to register the table in a meta, in a catalog. There's, like, metadata involved,

and it's not as straightforward

as, you know, just some files in a in a folder, right, as as Hive. Right? So there there's, like, a level of abstraction,

in the metadata.

And when that breaks, it's it's a little bit harder to,

to fix.

And this is where, like, you know, software engineering tooling comes in. It's to say, okay. Well, you know, if we want to

restore a table to what it was before, you know, we have to manage all this metadata, but we can use software to to to help us do that.

And just thinking back, like, another thing that I think has has been improved since I've,

you know, worked on Trino, like, last year is

this, like, control plane for Trino.

I know in the community, there's,

a lot of movement for a

I forgot what it's called. But it's, like, it's essentially a control plane where,

you're able to

manage multiple Trino clusters simultaneously.

And it handles routing for you, and it handles,

like, resource usage and everything. And it's a a part of the open source Trino community

that,

I think it's really helpful for, like, any company that's managing more than 1 cluster. So I I'm, like, really happy about that and seeing that, like, the open source community and the open source contribution, it's still very active

in that realm. I think, you know, we we should explore something like that and and use something like that, in the future.

In your experience

of working with these technologies,

dealing with the scale complexity

of Stripe,

what are some of the most interesting or innovative or unexpected ways that you have seen Iceberg and or Itrino used either in isolation or in combination?

Yeah. I think,

1 of the most

surprising thing,

and this is, like, both good and bad, is that

because we run Trino and Trino has a very simple,

SQL interface,

And we end up ex exposing

an API layer

where you can say, you know, post some SQL

and get a result back. Because of that

and the simplicity

of, like, just integrating with that

and have like, once you integrate with that, you're able to access the rest of the data ecosystem.

We've seen a lot of use cases where, you know, instead of

properly integrating

with

the underlying data,

people start to say, hey, like, why not just submit some SQL,

onto this endpoint,

and voila, you're in this ecosystem right away. And there's, like, good and bad in in that. Right? Like, some some of the things we've seen is, like, you know, integrating

operational data to say,

when something happens

right to the database

and,

on the other side, like, constantly pull from from this endpoint to do some kind of aggregation

to say, okay. If it's above this threshold,

like

Slack me or, like, page me or something.

So it almost goes into the realm of, like, observability

and also, like, what what Datadog is doing. Right? Like, observability

and, like, alerting and, like, all of that. And it's, like, a surprising use case, but I can I can see why it's it's like that? But, like, we're we're trying to figure out, like, is this the best use

for that kind of data? And if it's not, is there another way we can

expose the same set of functionality in our platform?

Maybe using another technology or maybe using, like, some other,

tooling to to support that.

But, you know, once you open up a a SQL endpoint to to the whole world,

within Stripe, it's, it's it's very,

very fun to see what engineers come up with to

to, like, you know, make it go into that, like, data ecosystem.

Going back to the performance piece, I'm wondering what are some of the transition points

with as far as use case where latency

becomes

problematic and people

try to pull data into other systems either for caching or they ask for

a different

storage engine or a different storage system for being able to reduce some of those latencies and how much latency people are willing to deal with even when maybe they shouldn't.

Right. Yeah. So that that's a

that question on latency

is

a very,

interesting 1 for us running the infra to to answer,

because, you know, sometimes you run a SQL query.

And because it's, like, ad hoc analytics, right, let's, like, focus on that use case, you expect it to come back relatively fast.

Right? Like, I don't want to run a query,

go get a coffee, come back, and it'll, like, just finish.

So people have this expectation

of, like, it should be fast.

Right? But

when you peel back the layers

and you look at, you know, what what some of the queries is doing,

it's like, you know, querying, like,

petabytes of data

for and and doing some kind of filtering on that. And if you look at it from that perspective, like, wow. I like, like, that query finished within 20 seconds. That's amazing. Right? Like, how did they ever do that?

And this is where, like, Iceberg comes in to to to help out. Right? Like, Iceberg provide you provide us the tooling to do some kind of optimization

either through, like, partitioning or, like, sorting or, like, just like metadata pruning.

So

it it helps, but at the end of the day,

you still have to pay that compute cost in order to, like, get the data that you need. So a lot of a lot of, like, the ideas from our end is to say, well,

can we actually tell our users that,

hey,

you're

querying

a lot of data and you're doing a lot of compute and you shouldn't expect it to be, like, finished within,

5 seconds, 10 seconds. So a lot of this question is, like, you know, people come to us and say, hey. Like,

is,

like, this thing slow today? Right? Like, is this is our is our,

infrastructure

very slow today

because I'm doing this

and it used to run, like, you know, in 10 seconds. Now I'm waiting in 30 seconds. Right? And there's a multitude of reasons why that's happening. Like, you know, maybe the query is is being queued because the cluster is busy.

Maybe, like, the data the underlying data grew. Right? Like, maybe it was, like, not that much before, and now it's it grew 10 times. So then, you know, if you're doing joins and blah blah, it just, like, explodes. Right?

So, you know, we

helped a little bit by providing a a, like, some kind of progress bar to say, you know, here is how how much work you're like, it there's a difference between waiting for, like, a circle to, like, just continuously

run versus, like, some kind of progress bar that, like, actually shows you the progress. Right? So that was, like, our initial answer to say, like, hey. You know, it's doing work in the background.

It's just not done yet. And that helps a little bit. And then the second 1 is, like, showing that, hey. Like, this is how much work you're doing. Right? Like, this is how much compute you're doing. Like, in CPU hours, it took, like, 20,

20, like, different machines

this amount of time to finish your query.

So, like, the it kind of resets the user expectation

of saying,

hey, why is it so slow to, oh my god. Like, it's doing that much work and it finished that fast. Like, that's so cool.

And in your work of

building these systems using Trino and Iceberg, digging into the technologies and the community, what are some of the most interesting or unexpected or challenging lessons that you've learned?

I think

the most interesting, and this is 1 I'm, like, really thankful for, is, like, the open source community around both Trino and Iceberg and how supportive people are

both from, like,

the the committer and the the people working on the project

and the general community surrounding it. Like, you know, contributing,

asking questions,

collaboration

with,

both of the community has been, like, really helpful and, like, really easy to work with.

And, you know, I, you know,

through my work, I've helped, like, contribute some of the pieces that we work on and, you know, help, like, kind of evangelize

some of the things that you can we can do with it. Like, for example,

the things we talked about with, like,

reading,

like, query metadata,

all this kind of metadata. I gave a talk on, like, Trino Fest on a couple years ago, and a bunch of people reached out to me. It's like, hey, how do you how did you do that? How can we do that?

And just, like, meeting other folks from other companies, like, you know, Lyft, Pinterest, Quora, like, people who are running,

Trino and Iceberg at scale and kind of taking learnings from them,

kind of sharing notes on, like, you know, what what problems you're facing, what problem we're facing, how how are you solving that, how are we solving that, has been really great. You know, shout out to some of the folks in in in the community, like Manfred, who, I met recently in Seattle,

who signed a Chino definitive guide for me,

which is awesome,

And, you know, Brian Olson, who

wrote this initial blog post on, like, Trina on ice, which I took a lot of inspiration from and kind of

welcomed me into this world of, like, using both of the technologies.

But I think, you know, just to summarize, it's, like, been really fun

collaborating with both, like, the contributors and the people who work

amongst, like, different companies.

And for people who are

working in the space of dealing with data, particularly if they already are relying on s 3 or maybe if they're just considering whether to go that direction, what are the cases where you see

TreeDo and or Iceberg as the wrong choice or Lakehouse

as an architectural

component as maybe the wrong choice for a given use case?

Yeah. I think the the way I see lake house and this whole architecture

and specifically with, like, something,

like, implementation wise as, like, Trino and Iceberg

is,

like,

this ecosystem

and these technologies

are kind of a reimplementation

of, like, what a database is. Right? Like, if you look at postgres and you take that apart and you say, what are all the components,

then you could map map that kind of, like, 1 to 1 with, like, the current

Right? And it just breaks apart that, like, database idea.

But, you know, if a database

is all you need, right, if you can run your

entire stack on a post grads, on a Aurora, on on whatever,

and you don't have any bottlenecks with that, great. Like, that's all you you should do. If you run into, like, issues of, like, you know, running analytics

slows down your,

operations

or your, like, inserts and, you know, you're sharing the same database for both analytics and and operational,

maybe then you can start looking into this realm of, like, lakehouse.

And especially if you're already on Blobstore,

the like, it's just so easy to read from Blobstore using Trino and Iceberg,

that,

like, it's honestly the best way to, like, analyze data in in that realm.

And specifically, if you're

within s 3 and you're using Hive,

go look at your s 3 bill for, like, lists and and and how much money it's costing you to do list

And give Iceberg a try and see, like, how much of that bill you save just from, like, that 1 operation alone.

So, you know, it's like moving to, like,

more and more feature sets within this ecosystem. Right? Because, you know, if you're

if you're facing problems with, like,

file caching, right, like, a feature that Trino has is, like, using a lock signal for, like, caching.

If you're facing issues with,

let's say, like,

using

a table format

who, you know, it's it's

like you have a bunch of parquet and you wanna analyze it and you want to, like,

put it in, like, a table format. You can use Trino to, like, put it in iceberg, put it in,

Hootie, put it in Delta.

Like, Trino is is supportive

of all 3 formats. You're able to read whatever data you want and then write to those formats.

So there's, like,

a lot of different use cases and a lot of different, like,

tools involved.

But these technology

gives you the tool in order to, like, help out, like, figure out exactly what you need, exactly what your bottleneck is, and to solve for those.

As you continue to rely on and invest in the Trino and Iceberg infrastructure at Stripe, what are some of the

capabilities

or integrations or projects that you're excited to dig into?

Yeah. I think, for me personally, I think

1 is

using Trino on the right path. I think

there's

exploration there,

and we've seen companies like Lyft, like Pinterest already use that for some of their use cases.

And especially with, you know, this thing called project Tardigrade where

Trino is now resilient to, partial query failure. So before Tardigrade,

you know, queries

when when something happens, the query the entire query fails,

and you just have to rerun the query again. And that's by design because Trina was designed to be very fast in the memory. With, project tardigrade,

there is checkpointing

so that, you know, you're even if something fails, your query will continue running. And it almost looked like a spark engine. Right? Where you say, well, you know, I'll just do some query and just wait for,

an answer or result to be back, to come back to me.

And because of that, I think there's

some use cases that we can explore for, like, writing.

And, like, instead of, you know, there's certain use cases that are probably better

for TreeNote to write than Spark to write.

So that's 1 of them. And I think another 1, that I'm interested in exploring is more on, like, the table format that, like, data lake side of things for, like, data sharing

to say, like, you know, how do we efficiently

share

some data

across platform,

across engine, across cloud

so that, you know, I can take 1 piece of data and make it available everywhere?

And that's more of, like, the iceberg side and the rest catalog side. But, you know, Trino as an engine that supports,

you know, every single

cloud storage and supports every single table format,

can be helpful in this equation because it's just a really powerful tool for, like, whatever you want to do in, like, data land.

Are there any aspects

of Trino,

Iceberg,

and your experience of working with them that we didn't discuss yet that you'd like to cover before we close out the show?

I think that everyone should be on Res Catalog,

and that

there's

almost,

I would say, almost no reason for you not to be, but that's kind of a bold statement.

But I think there are a lot of

interesting feature sets that once you're on the rest catalog,

you can start

adding

more and more features, and it just gives

you the, like the platform engineer, the flexibility

to add more and more features.

And the rest catalog is compatible with whatever catalog you currently have. If you have Glue, if you have Hive, if you have,

JDBC,

Postgres, whatever, you're able to connect the rest catalog to that

and

then use the rest catalog as the interface and plug that into Spark, plug that into Trino,

plug that into, like, Athena, Starburst, like, whatever

engine vendor that you have all supports this.

Right? So now this is, like, the centralized piece

for everything.

And once you're on this, you're able to move wherever you want and you just have the flexibility,

from a platform level.

But yeah, I think, you know, there, there's still a lot of active development in this area,

in the open source as well. So, you know, if anyone's interested,

contact me, contact the open source community.

I think there's, like, a lot of innovations in this space.

Absolutely.

Well, for anybody who wants to get in touch with you, follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. I think,

my perspective

has always been

reimplementing

what we had in the database world

into this new,

like, cloud slash

disaggregated,

like, separation

of compute and and storage

architecture where, you know, you have specialized

technology

for everything,

that was previously in the database world. Right? So, like, the res catalog as a data catalog,

compute is everywhere. Right? Like, you can bring if you're able to bring your data to whatever compute, you can use whatever compute.

Right? And then you can optimize your your data, your file format into, like, table format.

With Iceberg, you can do, like, indexes and and, like, a lot of these, like, interesting

features are implemented on top of, like, table formats, like indexes,

security, like encryption.

So, like, once you're in in this realm, the possibilities are endless, and there's a lot of innovations right now in this world. So it gets me really excited, I think, and to to to work in this. And on the Trino side, you know, they're always, like,

implementing

new things, coming up with new features. The engine just gets much, much better, like, throughout the years.

So I think

wherever,

you know, like technology we use, Trino itself as a,

a place that you can plug into

other ecosystems

will always be useful as a tool,

at the end of the day.

Alright. Well, thank you very much for taking the time today to join me and share your experiences

working with Trino and Iceberg and some of the interesting applications and combinations that you've been able to do with that. It's definitely great to hear from people who are digging deep into these systems and

understanding some of the new and interesting ways that they can be applied. I appreciate your

investment into making the rest catalog the preferred,

means of interacting with Iceberg. I agree with you on that. So thank you again for taking the time, and I hope you enjoy the rest of your day.

Of course. Thanks for having me.

Thank you for listening. Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it.

Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links