Summary
Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what role Trino and Iceberg play in Stripe's data architecture?
- What are the ways in which your job responsibilities intersect with Stripe's lakehouse infrastructure?
- What were the requirements and selection criteria that led to the selection of that combination of technologies?
- What are the other systems that feed into and rely on the Trino/Iceberg service?
- what kinds of questions are you answering with table metadata
- what use case/team does that support
- comparative utility of iceberg REST catalog
- What are the shortcomings of Trino and Iceberg?
- What are the most interesting, innovative, or unexpected ways that you have seen Iceberg/Trino used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stripe's data infrastructure?
- When is a lakehouse on Trino/Iceberg the wrong choice?
- What do you have planned for the future of Trino and Iceberg at Stripe?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
- Trino
- Iceberg
- Stripe
- Spark
- Redshift
- Hive Metastore
- Python Iceberg
- Python Iceberg REST Catalog
- Trino Metadata Table
- Flink
- Tabular
- Delta Table
- Databricks Unity Catalog
- Starburst
- AWS Athena
- Kevin Trinofest Presentation
- Alluxio
- Parquet
- Hudi
- Trino Project Tardigrade
- Trino On Ice
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. Go to [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end to end data lake has platform built on Trino, the query engine Apache Iceberg was designed for. Starburst has complete support for all table formats, including Apache Iceberg, Hive, and Delta Lake. And Starburst is trusted by teams of all sizes, including Comcast and DoorDash. Want to see Starburst in action? Go to data engineering podcast.com/starburst today and get $500 in credits to try Starburst Galaxy, the easiest and fastest way to get started using Trino.
Your host is Tobias Macy, and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse. So, Kevin, can you start by introducing yourself?
[00:01:01] Kevin Liu:
Of course. Hey, everyone. My name is Kevin Liu. I'm currently a software engineer at Stripe. In the past 3 years, I've been working in the big data infrastructure ecosystem, working primarily with Trino and Iceberg. Recently, I've taken on a new challenge working in data sharing, which is really awesome. I'm here to talk about use cases for Trino and Iceberg.
[00:01:25] Tobias Macey:
And do you remember how you first got started working in data?
[00:01:28] Kevin Liu:
I just stumbled upon it, honestly. I think, I started a new job at Stripe and put into a team working with Trino and Iceberg and had no idea what it was. And then it was just, you know, a bunch of learning, a bunch of, like, contribution to open source, working with the community, and getting to know the technology.
[00:01:50] Tobias Macey:
And in that context of Trino and Iceberg, the ways that it's being applied at Stripe, I'm wondering if you can just start by giving a bit of an overview about how it's being used, its overall position and responsibilities within the broader data architecture of Stripe, and some of the initial experiences that you had getting onboarded into that ecosystem?
[00:02:13] Kevin Liu:
Mhmm. Yeah. So a lot of the context before my time, I've I've only read and, like, have, you know, historical context on. But I can give you a overview of, like, what it what it looks like today. So tune on iceberg is what we use majority for business analytics, data analytics, dashboarding, anything to do with reading your big, big data dataset. So a lot of transformation, a lot of dashboarding presentation is done on the Trino layer. And that's kind of there's a clear distinction between our use for Trino and our use for, let's say, something like Spark. Right? Like, Spark is used for transformation.
It's used for writing, and and transforming big datasets versus, our current use case right now for Trino, which is, like, reading. So that that's the big distinction. And, you know, historically, we actually move away from Redshift into Trino. And, you know, the we've kind of reached a scale and the need to outscale Redshift. And Trino, the distributed nature of it, and the fact that you can scale up, and scale out TreeNow clusters and and have it work on your big data dataset, really helped us kind of scale our organization, scale our need for, like, business intelligence and data analytics.
So that's kind of the the the context behind it and, you know, which currently you can like, everyone who is using any kind of data and doing any kind of analytics at Stripe uses Trino to power that in the in the back end.
[00:03:57] Tobias Macey:
Given the very read heavy nature, and I know that there has been a lot of work in recent years put into being able to optimize for things like query caching, read speeds, etcetera. I know that Iceberg helps in that, but I'm curious what are some of the edge cases that you've run into and that you've seen other people run into in that very analytics heavy, read heavy environment of being able to query across these large datasets. And maybe if you're able to give some sense of what large means in your context and just some of the ways that you've started to hit against some of those limitations in your experience.
[00:04:36] Kevin Liu:
Yeah. I think, going going on the the kind of how big the data is. Right? And, like, what is what is the bottleneck that we were facing at Stripe? I think 1, you know, the the entire Stripe ecosystem of data is is massive. Like, you know, like, petabytes of data or more. And kind of doing the analytics for that, joining joining data, making sure that we, you know, we can do, reporting, we can do operational analytics. A lot of that is is powered by Trino, and it's powered by, you know, many, many clusters running many, many machines.
And I think 1 of the bottlenecks we face was actually the concurrency nature of of, like, running queries. Right? It it is really popular for people at Stripe to just pop into, like, an internal website, write some query, write some SQL query against our our big data store, and, you know, find some kind of data and results from that. And it's to the point where it's, like, almost a central repository where, you know, if you want data to be easily accessible, you dump it into this ecosystem. So then everyone else can just go to a centralized website and and query for that data. So, yeah, I touch up on, like, kind of the the concurrency nature of it, which is, from what I've read, the kind of bottleneck for Redshift is we you know, when you have thousands of people all trying to query at once, that's that's something that Redshift wasn't good at at the time.
And because we can scale out, Trino, you know, have multiple clusters, have multiple, machines backing that compute. And, you know, Trino is made for this kind of, like, very fast ad hoc analytics that really helped kind of know, okay you know, okay with, like, working working in this realm. And then another bottleneck we faced was kind of the whole, like, high versus iceberg. Right? Like, as with any organizations who started, like, you know, 5, 8, 10 years ago, Hive was kind of the the main aspect of data platform and especially in big data. Right? Like, everyone writes on Hive. Spark writes to Hive. Trino used to have Hive connector as, like, the the, initial, like, component of, like, the Trino connector.
But then eventually, everyone moved from Hive to Iceberg, and there's a clear reason for that. I think 1 of those, and, you know, people repeat this over and over again, is, how you how, like, Hive handles partitioning and the fact that you have to, like, list in in a blob store like s 3. And it's just, like, created in a different paradigm from before. Right? Like, Hive was made for Hadoop, which has a more efficient list, operation. And, you know, Hive for s 3 and Blobstore, now listing becomes very expensive. And a very real concern when you're working with big data is, like, partitioning and having to list all of the partitions.
And, you know, when you do more partitioning, like multiple levels of partitioning, let's say, I wanna partition by, like, a time stamp plus, like, I don't know, some some kind of, like, clustering ID. It just, like, blows up the kind of complexity of, like, how much listing you have to do. So first of all, listing in s 3 is really expensive. Listing, s 3 files really add up in terms of cost. Secondly, it's just not efficient. You know? Like, listing, you know, if you have, like, a 1000 files and you have to list them every time, it becomes really tedious and and and, like, it is memory, it's memory intense, plus it's very slow.
And when you're using the Hive ecosystem, a lot of the time, you have the Hive metastore as the the kind of metadata layer. A Hive Metastore, when you're listing thousands of files, really can just, like, blow up on you, which is not fun as a big data, engineer. But, yeah, I think a lot of a lot of the reasons, we picked and we ended up with Trino and Iceberg has to do with all of these, like, underlying bottlenecks. And it really helped us to, use these technologies to, like, solve for those bottlenecks.
[00:09:08] Tobias Macey:
Digging a bit more into the specifics of your infrastructure, it sounds like you are using s 3 or some analogous blob store as the storage layer. Obviously, you're using iceberg as the table format. I'm wondering if you are using the Hive connector for Iceberg as the means of storing the pointers to the appropriate metadata, or have you moved into the, rest catalog functionality that Iceberg has been adding in recently and some of the ways that you think about that decision.
[00:09:39] Kevin Liu:
Yeah. I think, a very general overview is you know, I already alluded Spark for, writing and transformation, Trino for reading the data, and then on the kind of data file side and the kind of table format side, we currently use, well, we use Blobstore as 3 8 4 for storing the actual data. But on the table format side, we're actually kind of stuck in between this, like, hive and iceberg land where we use both, and we we're still, like, slowly migrating from hive to iceberg. So, you know, it which means for both Spark and Trino, we are we have to use, both Hive and Iceberg because some tables are in Hive format, some tables are in Iceberg format.
And there's a slow transition to to move everything into iceberg. And, you know, both Hive and Iceberg require a kind of a metadata catalog, and we started with, Hive Metastore. That's, like, the the thing that Hive requires you to to have to run this. And, actually, Iceberg, the initial version of Iceberg can run off Hive Metastore because, when it was first implemented, it was, like, a direct replacement for for Hive. Right? So we're still stuck in the, like, we're using Hive metastore, but, you know, there's a slow migration to to, like, how do we get out of the Hive metastore? How do we get out of the Hive table format? And, you know, as you alluded to, I think the answer is the rest catalog.
I mean, I actually wrote some some subset articles on this basically saying that, hey, you know, even if you currently have the Hive metastore, you can add a rest catalog, component on top of that so that it's a level and direction between your compute and your catalog. And as of now, Trino, Spark, and a lot of other engines all support this REST catalog. So from a big data engineer perspective, the REST catalog really helps to decouple the the engine and the underlying table format, and it allows us to do the migration kind of transparently, without our users kind of knowing what's going on in the background. So it gives us it gives us a tool to in order for us to, like, create all of these, like, migrations and improvements and optimizations.
[00:12:12] Tobias Macey:
And I was just noticing recently too that you actually are at least involved in or if if not completely authored the Python iceberg rest catalog. I'm wondering what were some of your inspiration or motivation to put in that work and some of the ways that you're hoping to see that used?
[00:12:31] Kevin Liu:
Yeah. I think, I'm really excited for the REDS catalog, honestly, because just from an industry standpoint, everyone has standardized on this format. So, you know, you can go to Spark. You can go to Snowflake. You can go to, like, any engines or vendors, and they have a plug in for the RENS catalog. Right? Which means there's a standard now where you can say, I have this this concept for a rest catalog, and I can take that and go wherever. Right? And this really helps with the the, like, you know, avoiding vendor lock in, making making it so that it's, like, very easy to transition from 1 platform to another. And it really, like, levels the playing field for, like, what the vendor should provide. Right? Like, if, if it's so easy for me to take my data and go somewhere else, my vendor, my current vendor better give me the best available technology, like, optimization, everything. Because whereas if someone else is better, I could just go there. But yeah. And then coming back to the Python iceberg, rest catalog implementation, that's been on my mind for quite a while just because, you know, for the majority of my time at at Stripe, I've been dealing with the Hive metastore, making sure it doesn't blow up, looking at, like, you know, what it's doing in the ecosystem.
And I really think this, like, rest catalog idea is a direct replacement, if not more, for the Hive meta store. And there's, like, a lot of interesting thing you can do once everyone's on this kind of, technology slash implementation.
[00:14:12] Tobias Macey:
Moving back into some of the work that you're doing at Stripe, the ways that you are taking advantage of the Trino and Iceberg combination. I'm wondering what are some of the projects that you have been most excited about or most interested by in your work at Stripe and some of the specifics of Iceberg, Trino, the combination thereof that have led you to being able to employ them beyond just the very simple, I have my data somewhere. I can query it.
[00:14:44] Kevin Liu:
Right. Yeah. I think the the biggest thing that I've, I'm, like, really proud of is utilizing these 2 technologies and kind of how do you say it? Like, syncing them in in in a in a way where it's not just, you know, Trino queries iceberg and get the result. Like, Trino itself is a really powerful engine. It allows you to read from whatever data source that you can connect to. Right? You can say, I have a Postgres database. I can connect that to Trino and magically now you can write some SQL against a Postgres database somewhere else. Right? So the idea that that, you know, we we're working with Stripe, the the the thing that I'm really proud of is kind of using this ecosystem and plugging in things all the way back to Trino. So I'll I'll give you a concrete example. Right? Like, I talked about the Hive metastore.
I talked about, like, Iceberg. Iceberg itself has a lot of metadata. Because it's a table format, it can store more than just the underlying data. Right? It could gives you it can give you the the details on, like, partition scheme on the the schema, on metadata information about how big your table is, like, what is the distribution of your table, what is the min max, like, all the stats. And all of this metadata in Iceberg Land is in, like, metadata files. Right? And as a, you know, big data engineer trying to figure out what's going on for a specific iceberg table and digging into it, it's it was really difficult to, like, look at a specific iceberg table and do some diagnostic on, like, what's going on in that table. Right? So 1 of the things that, you know, I created at at Stripe was to plug in that metadata ecosystem back into Trino.
So then I can now write SQL against that and, like, diagnose my iceberg table using Trino. So there's, like, 2 aspect of this. 1 is Trino has this idea of, like, metadata table where, you know, instead of reading the data itself, you can read the metadata of the table format. So, like, things like partitioning, I could just do select all from the table, like, dollar sign partition. And now instead of reading the data, it gives me partition data back. And it's in a a tabular format, so now I can join it if I want. I can, like, aggregate it and perform any kind of analytics I want. So that's that's 1 piece of it where, you know, you can plug the metadata back into Trino itself. The second piece is on this, like, catalog slash, like, Hive metastore piece. Right? Like, Hive Metastore is essentially a catalog of all of the data and table that's registered at at Stripe. That's, you know, that you're able to read at Stripe.
And the idea is, like, how do we analyze data so then, you know, there there's a easier way to, like, figure out what's going on in our ecosystem. Right? So, for example, like, the Hive metastore itself is backed by underlying database. For us, we use Postgres. Right? I want to read that Postgres data and let's say figure out, how many tables are there in total at Stripe. Right? How many hive tables? How many iceberg tables? So what we did was we plugged the underlying database itself back into Trino. So then now I can query that database to say, hey. Like, what is this table?
What is the last time it was updated? Like, these metadata are all located in the database. And by plugging it in directly to Trino, I can query it in real time because underneath, it's issuing a table directly to that, to that database. Mind you know, you should create a read only database if you're doing that for for your for your backing store. And the third piece is we took these 2 idea and we we looked at this and we say, okay. This is great. Like, we have a pretty nice, like, infrastructure component to plug back into Trino. What can we do about that?
And, you know, we're we're running this, like, infrastructure essentially to provide the query like, SQL query capability at Stripe. And a big thing we're facing at the time is kind of analytics on what is being run on Trino, like, how much CPU it's using, how much memory it's using. Are we blowing up a cluster? Like, are we optimizing the cluster? So the idea is to, like, log each query that's run on Trino. And, you know, Trino has a nice connector that gives us that information along with, like, a lot of information on, like, how much memory, how much CPU, like like a bunch of usage information from Trino itself.
So what we did is we took that information in the in the connector. We dumped that into another database, like a Postgres database. So every time a a query is run, we would, like, observe a row that says, here's the information about that, about that query into our Postgres database. And then we plug that back into Trino. So then now on Trino, you can say, okay. In the last 5 minutes, how many queries were run? Right? And what like, this is something that we are interested in specifically is, was there a spike in, how many queries were submitted in the last 5 minutes?
And now you have, like, almost real time near real time analytics on, like, everything that's run on your platform. And that's really helpful for us from, like, the infrastructure standpoint because now it gives us more observability, and it gives us just a component in our platform that we can add more information to. Like, for example, we open up Trino to, like, services within Stripe as well. Right? So now we can say, okay. Well, how many like, what is the resource usage of this specific core, of this specific service and what kind of query is it running? And it's just like a nice piece of component that we can now, like, add more and more metadata in and use that for, like, analytics.
[00:21:14] Tobias Macey:
And with that meta capability, the meta cognition, if you will, about your data platform being able to round trip all of this information through that system, I'm wondering what are the types of use cases that that supports and some of the types of questions that people are asking and answering about the data platform itself using those capabilities?
[00:21:38] Kevin Liu:
Yeah. I think, a big 1, and and we always get asked this is, you know, a data producer now has a v 2 version or an upgraded version of the table. Right? And the the main question is, hey, I wanna deprecate this, but I don't wanna break everything. So, you know, can I go and, you know, figure out who's using this table? Either tell them to, hey, there's a new 1 or just figure out, like, you know, when I do deprecate this, it's not going to break any, like, obscure work stream. Right? So for us, being able to, you know, analyze every query that was run on the platform and being able to analyze, like, exactly what was what tables were used, And that's a capability that Trino has, for you know, to to give us, like, you know, in this query, which tables were used, which tables were queried.
And that, like, once we expose the information, like, that use case is, like, super easy to to to achieve. Right? Like, for me, I could just go, like, you know, select all from this, like, create database where the affected table is like this thing. And then, what I talked about before with, like, the metadata on on top like, included in this query, we can actually point back to where this query was originated. Right? It could be, like, this service is actually running this query, which has this table, or, you know, these people are are querying it or it's like this dashboard that we're using for for, like, you know, you querying this table.
So then now you have a pointer back to the original use case, and now you can say, okay. Well, you know, let me let me email everyone to say, hey. Like, we're deprecating this. So that that's like a very, common use case that we have is is to say, like, you know, table deprecation, table translation. And and there are a few more that's more, like, on the infra side. Like, for example, like, rate limiting and, like, making sure that people are using their fair share of, like, CPU and and memory consumption, making sure that, like, people's queries are optimized, that, you know, please include a a partition filter on your thing so you don't read, like, 10 years of data.
They're they're, like, things like that that we can use historic historical, like, query information on in order to, like, help us make decisions for, like, current use cases.
[00:24:10] Tobias Macey:
Because of the fact that you have Trino as the read path, you have Spark as the write path, I'm sure that there are other systems that may be able to hook into those iceberg tables. What are some of the ways that you are accounting for the fact that maybe not absolutely every read path or not absolutely every point of access to a particular data location does go through Trino, in the course of being able to determine, can I safely deprecate this table? Can I delete it? Can I stop feeding data into it? Just some of the ways that you are managing the multi tool nature of the ecosystem, both at Stripe and just in the in in data more generally.
[00:24:52] Kevin Liu:
Yeah. I think, everything becomes hard once you start adding new tools. Right? So, you know, Flink is a is a very real use case for, like, streaming and real time analytics. Adding Flink to the mix adds an extra level of complexity in this ecosystem. So, you know, back to your question about, you know, Spark on the re Spark on the right path, Trina on the repath, like, how do you know that you're covering all the bases for, like, the usage for, like, let's say, 1 particular table. Right? And this is where kind of, I'm really excited about the REST catalog and and this, like, iceberg notion of, like, having a catalog is because you can really push down everything I just described. And I say push down because I imagine, like, Trino on on top and, like, you know, the table formats below it, you can really push down all of the features on a catalog level. Right? So things like logging usage.
Right? The catalog technically knows exactly when a table is used. Right? Because you're using the catalog for that. Right? Like, for for Trino to say, I want to read a table, it has to go to the catalog, grab that information before coming back to say, hey, this is where the table is. And so we're really seeing this, like, catalog as the centralized place when you have multiple engines. Right? You can implement an engine specific feature when you don't have many engines. You know, I can implement everything I said once in Trino and another in in Spark. But when you have Flink or another engine, you gotta redo everything else again.
Right? So the idea is if we can push all of these features lower in the stack, then the whole ecosystem is kind of engine agnostic. So then we can plug in another engine and still get the same kind of feature set without any changes to our ecosystem. We're not there yet, but that's kind of the idea. And you see, like, catalog vendors kind of have similar idea. Like, Tabular was really into this, like, idea of security on the catalog layer. So you don't have to reimplement security in every single engine and have customized code for every single engine because, you know, it's it's a lot of effort engineering wise and, you know, there's nuances to that. But if you can't lock it down in a catalog layer, then you can bring that catalog to wherever engine and get the same behaviors.
[00:27:35] Tobias Macey:
In that future state of the catalog knows all of this information about the ways that the data is being accessed and also to the work that you were doing with your rest catalog implementation. I imagine that a lot of that information can be captured based on the interaction with the interface by which the catalog is queried or interacted with. And I think that by moving more into that REST API and away from the Hive metastore space that it makes it a little bit easier to implement that as maybe not necessarily across the board. Every implementation is going to do it a bit, but it makes it easier to put that logic into the REST API layer so that you can have some of that some of the information about what are the access patterns, how often is it happening, what are the sources by which these accesses are coming from?
[00:28:31] Kevin Liu:
Yeah. I think a lot of the feature sets and the ideas for this, like, catalog layer comes from Hive Metastore. Like Hive Metastore contains a lot of feature sets, and this is like good and bad, right? Like there's like a multitude of feature set that were implemented on Hive Metastore because it was like, you know, the the centralized place that everyone runs, and it's like the meeting spot that everyone comes to. Right? And and it itself has some, like, bottlenecks in terms of, like, performance, but the idea still holds true. Right? Like, we were, you know, with the Hive metastore, we were doing our customized code for, you know, like, security, for authorization, for, like, you know, data what's the word? Like, we we have, like, this idea of, like, data regionality.
Like, data has to be in specific region, and we're enforcing that in the Hive Metastore layer. But it's the same idea. Right? Like, now you just replace Hive Metastore with, like, the rest catalog. I mean, you have to be within the iceberg ecosystem, but the rest catalog is is like in in similar vein, doing performing the same duty where it's the centralized place that every every engine has to come to in order to do some kind of work. And, you know, like, in the Delta table world, like, Unity Catalog is doing the exact same thing. Right? Like, Unity Catalog is this this piece of code that you have to, like, reach out to when you try to do some kind of write in the Delta world.
Right? So I see it as, like, you know, rest catalog is the basis. It's just, you know, some a a spec with the implementation. But with that, you can add a a bunch of feature sets to, like, improve upon this idea.
[00:30:23] Tobias Macey:
So we've talked a lot about the ways that Trino and Iceberg are useful both in isolation and in combination. What are some of the aspects of those technologies that are still a pain point and some of the ways that you hope to see them evolve to improve the overall experience of working with them, some of the features that maybe you think should be pushed into 1 or the other, and maybe some of the features that could be pushed into 1 of those layers, but definitely shouldn't because then you end up with some bloated monstrosity that does too many things.
[00:31:01] Kevin Liu:
Right. I think, first of all, like, managing Trino is really hard, especially for a big organization with growing needs and, growing, you know, bottlenecks for, like, this and that. And just managing the infrastructure itself becomes tedious, almost like a operational job. So that that's 1 thing. Right? Because we run Trino. We run open source Trino ourselves. So a lot of the operational burden is is on is on the team. Right? And because of how widely used it is and how critical it is to, like, operation daily operation of of the company, it becomes critical for us to run this very well with, like, minimal downtime with, you know, very strict SLA.
So that's that's the hard part. And, you know, maybe we'll explore it, like, using, like, a vendor solution or, you know, something that will alleviate the the, the maintenance burden from us. Right? And there's, like, you know, Starburst is 1 of them, Athena. There's, like, other players in this where, you know, we can say, okay. Well, you know, we can't handle the kind of tooling and the ecosystem around, but running the underlying infrastructure. And for us, it's just like a black box. Right? We just run some compute nodes, and we we send SQL in and results come out. So, like, if we can, delegate that somewhere else, I think it will alleviate a lot of the maintenance burden burden for us. And I think the other thing on iceberg is that I think managing iceberg, it's still difficult because this idea of, like, you know, you have to register the table in a meta, in a catalog. There's, like, metadata involved, and it's not as straightforward as, you know, just some files in a in a folder, right, as as Hive. Right? So there there's, like, a level of abstraction, in the metadata.
And when that breaks, it's it's a little bit harder to, to fix. And this is where, like, you know, software engineering tooling comes in. It's to say, okay. Well, you know, if we want to restore a table to what it was before, you know, we have to manage all this metadata, but we can use software to to to help us do that. And just thinking back, like, another thing that I think has has been improved since I've, you know, worked on Trino, like, last year is this, like, control plane for Trino. I know in the community, there's, a lot of movement for a I forgot what it's called. But it's, like, it's essentially a control plane where, you're able to manage multiple Trino clusters simultaneously.
And it handles routing for you, and it handles, like, resource usage and everything. And it's a a part of the open source Trino community that, I think it's really helpful for, like, any company that's managing more than 1 cluster. So I I'm, like, really happy about that and seeing that, like, the open source community and the open source contribution, it's still very active in that realm. I think, you know, we we should explore something like that and and use something like that, in the future.
[00:34:27] Tobias Macey:
In your experience of working with these technologies, dealing with the scale complexity of Stripe, what are some of the most interesting or innovative or unexpected ways that you have seen Iceberg and or Itrino used either in isolation or in combination?
[00:34:44] Kevin Liu:
Yeah. I think, 1 of the most surprising thing, and this is, like, both good and bad, is that because we run Trino and Trino has a very simple, SQL interface, And we end up ex exposing an API layer where you can say, you know, post some SQL and get a result back. Because of that and the simplicity of, like, just integrating with that and have like, once you integrate with that, you're able to access the rest of the data ecosystem. We've seen a lot of use cases where, you know, instead of properly integrating with the underlying data, people start to say, hey, like, why not just submit some SQL, onto this endpoint, and voila, you're in this ecosystem right away. And there's, like, good and bad in in that. Right? Like, some some of the things we've seen is, like, you know, integrating operational data to say, when something happens right to the database and, on the other side, like, constantly pull from from this endpoint to do some kind of aggregation to say, okay. If it's above this threshold, like Slack me or, like, page me or something.
So it almost goes into the realm of, like, observability and also, like, what what Datadog is doing. Right? Like, observability and, like, alerting and, like, all of that. And it's, like, a surprising use case, but I can I can see why it's it's like that? But, like, we're we're trying to figure out, like, is this the best use for that kind of data? And if it's not, is there another way we can expose the same set of functionality in our platform? Maybe using another technology or maybe using, like, some other, tooling to to support that.
But, you know, once you open up a a SQL endpoint to to the whole world, within Stripe, it's, it's it's very, very fun to see what engineers come up with to to, like, you know, make it go into that, like, data ecosystem.
[00:36:57] Tobias Macey:
Going back to the performance piece, I'm wondering what are some of the transition points with as far as use case where latency becomes problematic and people try to pull data into other systems either for caching or they ask for a different storage engine or a different storage system for being able to reduce some of those latencies and how much latency people are willing to deal with even when maybe they shouldn't.
[00:37:24] Kevin Liu:
Right. Yeah. So that that's a that question on latency is a very, interesting 1 for us running the infra to to answer, because, you know, sometimes you run a SQL query. And because it's, like, ad hoc analytics, right, let's, like, focus on that use case, you expect it to come back relatively fast. Right? Like, I don't want to run a query, go get a coffee, come back, and it'll, like, just finish. So people have this expectation of, like, it should be fast. Right? But when you peel back the layers and you look at, you know, what what some of the queries is doing, it's like, you know, querying, like, petabytes of data for and and doing some kind of filtering on that. And if you look at it from that perspective, like, wow. I like, like, that query finished within 20 seconds. That's amazing. Right? Like, how did they ever do that?
And this is where, like, Iceberg comes in to to to help out. Right? Like, Iceberg provide you provide us the tooling to do some kind of optimization either through, like, partitioning or, like, sorting or, like, just like metadata pruning. So it it helps, but at the end of the day, you still have to pay that compute cost in order to, like, get the data that you need. So a lot of a lot of, like, the ideas from our end is to say, well, can we actually tell our users that, hey, you're querying a lot of data and you're doing a lot of compute and you shouldn't expect it to be, like, finished within, 5 seconds, 10 seconds. So a lot of this question is, like, you know, people come to us and say, hey. Like, is, like, this thing slow today? Right? Like, is this is our is our, infrastructure very slow today because I'm doing this and it used to run, like, you know, in 10 seconds. Now I'm waiting in 30 seconds. Right? And there's a multitude of reasons why that's happening. Like, you know, maybe the query is is being queued because the cluster is busy.
Maybe, like, the data the underlying data grew. Right? Like, maybe it was, like, not that much before, and now it's it grew 10 times. So then, you know, if you're doing joins and blah blah, it just, like, explodes. Right? So, you know, we helped a little bit by providing a a, like, some kind of progress bar to say, you know, here is how how much work you're like, it there's a difference between waiting for, like, a circle to, like, just continuously run versus, like, some kind of progress bar that, like, actually shows you the progress. Right? So that was, like, our initial answer to say, like, hey. You know, it's doing work in the background. It's just not done yet. And that helps a little bit. And then the second 1 is, like, showing that, hey. Like, this is how much work you're doing. Right? Like, this is how much compute you're doing. Like, in CPU hours, it took, like, 20, 20, like, different machines this amount of time to finish your query.
So, like, the it kind of resets the user expectation of saying, hey, why is it so slow to, oh my god. Like, it's doing that much work and it finished that fast. Like, that's so cool.
[00:40:42] Tobias Macey:
And in your work of building these systems using Trino and Iceberg, digging into the technologies and the community, what are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:40:54] Kevin Liu:
I think the most interesting, and this is 1 I'm, like, really thankful for, is, like, the open source community around both Trino and Iceberg and how supportive people are both from, like, the the committer and the the people working on the project and the general community surrounding it. Like, you know, contributing, asking questions, collaboration with, both of the community has been, like, really helpful and, like, really easy to work with. And, you know, I, you know, through my work, I've helped, like, contribute some of the pieces that we work on and, you know, help, like, kind of evangelize some of the things that you can we can do with it. Like, for example, the things we talked about with, like, reading, like, query metadata, all this kind of metadata. I gave a talk on, like, Trino Fest on a couple years ago, and a bunch of people reached out to me. It's like, hey, how do you how did you do that? How can we do that?
And just, like, meeting other folks from other companies, like, you know, Lyft, Pinterest, Quora, like, people who are running, Trino and Iceberg at scale and kind of taking learnings from them, kind of sharing notes on, like, you know, what what problems you're facing, what problem we're facing, how how are you solving that, how are we solving that, has been really great. You know, shout out to some of the folks in in in the community, like Manfred, who, I met recently in Seattle, who signed a Chino definitive guide for me, which is awesome, And, you know, Brian Olson, who wrote this initial blog post on, like, Trina on ice, which I took a lot of inspiration from and kind of welcomed me into this world of, like, using both of the technologies.
But I think, you know, just to summarize, it's, like, been really fun collaborating with both, like, the contributors and the people who work amongst, like, different companies.
[00:42:54] Tobias Macey:
And for people who are working in the space of dealing with data, particularly if they already are relying on s 3 or maybe if they're just considering whether to go that direction, what are the cases where you see TreeDo and or Iceberg as the wrong choice or Lakehouse as an architectural component as maybe the wrong choice for a given use case?
[00:43:18] Kevin Liu:
Yeah. I think the the way I see lake house and this whole architecture and specifically with, like, something, like, implementation wise as, like, Trino and Iceberg is, like, this ecosystem and these technologies are kind of a reimplementation of, like, what a database is. Right? Like, if you look at postgres and you take that apart and you say, what are all the components, then you could map map that kind of, like, 1 to 1 with, like, the current Right? And it just breaks apart that, like, database idea. But, you know, if a database is all you need, right, if you can run your entire stack on a post grads, on a Aurora, on on whatever, and you don't have any bottlenecks with that, great. Like, that's all you you should do. If you run into, like, issues of, like, you know, running analytics slows down your, operations or your, like, inserts and, you know, you're sharing the same database for both analytics and and operational, maybe then you can start looking into this realm of, like, lakehouse.
And especially if you're already on Blobstore, the like, it's just so easy to read from Blobstore using Trino and Iceberg, that, like, it's honestly the best way to, like, analyze data in in that realm. And specifically, if you're within s 3 and you're using Hive, go look at your s 3 bill for, like, lists and and and how much money it's costing you to do list And give Iceberg a try and see, like, how much of that bill you save just from, like, that 1 operation alone. So, you know, it's like moving to, like, more and more feature sets within this ecosystem. Right? Because, you know, if you're if you're facing problems with, like, file caching, right, like, a feature that Trino has is, like, using a lock signal for, like, caching.
If you're facing issues with, let's say, like, using a table format who, you know, it's it's like you have a bunch of parquet and you wanna analyze it and you want to, like, put it in, like, a table format. You can use Trino to, like, put it in iceberg, put it in, Hootie, put it in Delta. Like, Trino is is supportive of all 3 formats. You're able to read whatever data you want and then write to those formats. So there's, like, a lot of different use cases and a lot of different, like, tools involved. But these technology gives you the tool in order to, like, help out, like, figure out exactly what you need, exactly what your bottleneck is, and to solve for those.
[00:46:11] Tobias Macey:
As you continue to rely on and invest in the Trino and Iceberg infrastructure at Stripe, what are some of the capabilities or integrations or projects that you're excited to dig into?
[00:46:24] Kevin Liu:
Yeah. I think, for me personally, I think 1 is using Trino on the right path. I think there's exploration there, and we've seen companies like Lyft, like Pinterest already use that for some of their use cases. And especially with, you know, this thing called project Tardigrade where Trino is now resilient to, partial query failure. So before Tardigrade, you know, queries when when something happens, the query the entire query fails, and you just have to rerun the query again. And that's by design because Trina was designed to be very fast in the memory. With, project tardigrade, there is checkpointing so that, you know, you're even if something fails, your query will continue running. And it almost looked like a spark engine. Right? Where you say, well, you know, I'll just do some query and just wait for, an answer or result to be back, to come back to me.
And because of that, I think there's some use cases that we can explore for, like, writing. And, like, instead of, you know, there's certain use cases that are probably better for TreeNote to write than Spark to write. So that's 1 of them. And I think another 1, that I'm interested in exploring is more on, like, the table format that, like, data lake side of things for, like, data sharing to say, like, you know, how do we efficiently share some data across platform, across engine, across cloud so that, you know, I can take 1 piece of data and make it available everywhere? And that's more of, like, the iceberg side and the rest catalog side. But, you know, Trino as an engine that supports, you know, every single cloud storage and supports every single table format, can be helpful in this equation because it's just a really powerful tool for, like, whatever you want to do in, like, data land.
[00:48:26] Tobias Macey:
Are there any aspects of Trino, Iceberg, and your experience of working with them that we didn't discuss yet that you'd like to cover before we close out the show?
[00:48:36] Kevin Liu:
I think that everyone should be on Res Catalog, and that there's almost, I would say, almost no reason for you not to be, but that's kind of a bold statement. But I think there are a lot of interesting feature sets that once you're on the rest catalog, you can start adding more and more features, and it just gives you the, like the platform engineer, the flexibility to add more and more features. And the rest catalog is compatible with whatever catalog you currently have. If you have Glue, if you have Hive, if you have, JDBC, Postgres, whatever, you're able to connect the rest catalog to that and then use the rest catalog as the interface and plug that into Spark, plug that into Trino, plug that into, like, Athena, Starburst, like, whatever engine vendor that you have all supports this.
Right? So now this is, like, the centralized piece for everything. And once you're on this, you're able to move wherever you want and you just have the flexibility, from a platform level. But yeah, I think, you know, there, there's still a lot of active development in this area, in the open source as well. So, you know, if anyone's interested, contact me, contact the open source community. I think there's, like, a lot of innovations in this space.
[00:50:03] Tobias Macey:
Absolutely. Well, for anybody who wants to get in touch with you, follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:20] Kevin Liu:
Yeah. I think, my perspective has always been reimplementing what we had in the database world into this new, like, cloud slash disaggregated, like, separation of compute and and storage architecture where, you know, you have specialized technology for everything, that was previously in the database world. Right? So, like, the res catalog as a data catalog, compute is everywhere. Right? Like, you can bring if you're able to bring your data to whatever compute, you can use whatever compute. Right? And then you can optimize your your data, your file format into, like, table format. With Iceberg, you can do, like, indexes and and, like, a lot of these, like, interesting features are implemented on top of, like, table formats, like indexes, security, like encryption.
So, like, once you're in in this realm, the possibilities are endless, and there's a lot of innovations right now in this world. So it gets me really excited, I think, and to to to work in this. And on the Trino side, you know, they're always, like, implementing new things, coming up with new features. The engine just gets much, much better, like, throughout the years. So I think wherever, you know, like technology we use, Trino itself as a, a place that you can plug into other ecosystems will always be useful as a tool, at the end of the day.
[00:51:56] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your experiences working with Trino and Iceberg and some of the interesting applications and combinations that you've been able to do with that. It's definitely great to hear from people who are digging deep into these systems and understanding some of the new and interesting ways that they can be applied. I appreciate your investment into making the rest catalog the preferred, means of interacting with Iceberg. I agree with you on that. So thank you again for taking the time, and I hope you enjoy the rest of your day.
[00:52:31] Kevin Liu:
Of course. Thanks for having me.
[00:52:38] Tobias Macey:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Overview
Guest Introduction: Kevin Liu from Stripe
Getting Started with Data Engineering
Stripe's Use of Trino and Iceberg
Challenges and Bottlenecks at Stripe
Infrastructure and Migration Strategies
REST Catalog Implementation
Use Cases and Analytics
Managing Multi-Tool Ecosystems
Unexpected Use Cases and Performance
Lessons Learned and Community Support
When Trino and Iceberg Are Not the Right Choice
Future Projects and Integrations
Final Thoughts and Recommendations