Summary
In this episode of the Data Engineering Podcast Mai-Lan Tomsen Bukovec, Vice President of Technology at AWS, talks about the evolution of Amazon S3 and its profound impact on data architecture. From her work on compute systems to leading the development and operations of S3, Mylan shares insights on how S3 has become a foundational element in modern data systems, enabling scalable and cost-effective data lakes since its launch alongside Hadoop in 2006. She discusses the architectural patterns enabled by S3, the importance of metadata in data management, and how S3's evolution has been driven by customer needs, leading to innovations like strong consistency and S3 tables.
Announcements
Parting Question
In this episode of the Data Engineering Podcast Mai-Lan Tomsen Bukovec, Vice President of Technology at AWS, talks about the evolution of Amazon S3 and its profound impact on data architecture. From her work on compute systems to leading the development and operations of S3, Mylan shares insights on how S3 has become a foundational element in modern data systems, enabling scalable and cost-effective data lakes since its launch alongside Hadoop in 2006. She discusses the architectural patterns enabled by S3, the importance of metadata in data management, and how S3's evolution has been driven by customer needs, leading to innovations like strong consistency and S3 tables.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.
- Your host is Tobias Macey and today I'm interviewing Mai-Lan Tomsen Bukovec about the evolutions of S3 and how it has transformed data architecture
- Introduction
- How did you get involved in the area of data management?
- Most everyone listening knows what S3 is, but can you start by giving a quick summary of what roles it plays in the data ecosystem?
- What are the major generational epochs in S3, with a particular focus on analytical/ML data systems?
- The first major driver of analytical usage for S3 was the Hadoop ecosystem. What are the other elements of the data ecosystem that helped shape the product direction of S3?
- Data storage and retrieval have been core primitives in computing since its inception. What are the characteristics of S3 and all of its copycats that led to such a difference in architectural patterns vs. other shared data technologies? (e.g. NFS, Gluster, Ceph, Samba, etc.)
- How does the unified pool of storage that is exemplified by S3 help to blur the boundaries between application data, analytical data, and ML/AI data?
- What are some of the default patterns for storage and retrieval across those three buckets that can lead to anti-patterns which add friction when trying to unify those use cases?
- The age of AI is leading to a massive potential for unlocking unstructured data, for which S3 has been a massive dumping ground over the years. How is that changing the ways that your customers think about the value of the assets that they have been hoarding for so long?
- What new architectural patterns is that generating?
- What are the most interesting, innovative, or unexpected ways that you have seen S3 used for analytical/ML/Ai applications?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on S3?
- When is S3 the wrong choice?
- What do you have planned for the future of S3?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- AWS S3
- Kinesis
- Kafka
- SQS
- EMR
- Drupal
- Wordpress
- Netflix Blog on S3 as a Source of Truth
- Hadoop
- MapReduce
- Nasa JPL
- FINRA == Financial Industry Regulatory Authority
- S3 Object Versioning
- S3 Cross Region
- S3 Tables
- Iceberg
- Parquet
- AWS KMS
- Iceberg REST
- DuckDB
- NFS == Network File System
- Samba
- GlusterFS
- Ceph
- MinIO
- S3 Metadata
- Photoshop Generative Fill
- Adobe Firefly
- Turbotax AI Assistant
- AWS Access Analyzer
- Data Products
- S3 Access Point
- AWS Nova Models
- LexisNexis Protege
- S3 Intelligent Tiering
- S3 Principal Engineering Tenets
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
This is a pharmaceutical ad for SOTA data quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of undiagnosed data quality syndrome, also known as UDQS. Ask your data team about soda. With soda metrics observability, you can track the health of your KPIs and metrics across the business, automatically detecting anomalies before your CEO does. It's 70% more accurate than industry benchmarks and the fastest in the category, analyzing 1,100,000,000 rows in just sixty four seconds.
And with collaborative data contracts, engineers and business can finally agree on what done looks like so you can stop fighting over column names and start trusting your data again. Whether you're a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing soda may include increased trust in your metrics, reduced late night Slack emergencies, spontaneous high fives across departments, fewer meetings and less back and forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a 1,000 plus dollar custom mechanical keyboard.
Visit dataengineeringpodcast.com/soda to sign up and follow soda's launch week, which starts on June 9. Your host is Tobias Macey, and today I'm interviewing Milon Thompson Bucabeck about the evolutions of s three and how it has transformed data architecture. Architecture. So, Mylan, for, anybody who isn't familiar with your work, can you start by introducing yourself?
[00:02:15] Mai-Lan Tomsen Bukovec:
Yeah. Absolutely. My name is Mylan Thompson Bukovec, and I am a vice president of technology here in AWS. I run AWS services that are basically the data stack. So everything from the bottom of the stack, the bottom turtle, if you will, which is, you know, Amazon s three and our file systems through to the streaming and messaging services like Kinesis, Manage Kafka, SQS, SNS, and then the analytics capabilities of Asena, Redshift, EMR, and basically the whole top to bottom of the data stack. And I am delighted to be here. Tobias, I've been listening to your podcast for many years, and I'm, I'm excited to talk about how s three fits into data systems.
[00:02:59] Tobias Macey:
Absolutely. And do you remember how you first got started working in data?
[00:03:03] Mai-Lan Tomsen Bukovec:
Well, to be honest, I I came into data management and data storage by by way of s three. And, you know, I've been working on it on AWS since 02/2010, but I was mostly a compute person. I used to run the web server engineering for IS at Microsoft. And, you know, the the way I came to AWS is that, I used to spend a lot of time working on making PHP run on the Microsoft web server. And so I had to run all these PHP open source applications like Drupal and WordPress and Windows Server, And I have to go and and give demos and, show customers and show folks at conferences how to do this. And so, you know, this is back in the day. Okay? Tobias and I I I had to ship servers to do my demos.
And so way back in 02/2008, late '2 thousand '8, I started to use e c two. And I did it because I didn't wanna ship hardware to conferences and to customer sites, and it just kinda made me realize that this is gonna be the wave of the future. And so, you know, I I reached out to some folks that I knew in AWS, in 02/2010, and I and I joined, you know, again, on the compute side. I was actually hired by that, same guy that I work for today. And I I joined to run a set of services like e c two auto scaling and simple workflow, and that gave me a crash course in distributed systems. As you might imagine, I was coming from client server architectures.
And it was in 02/2013. It was early in 2013 that I joined Esri as a general manager, and that meant that I was running the s three service from all the development and operations. We operate in a DevOps model to product. And that is what brought me to data systems, data management and data systems because it was pretty clear in 2013 that everybody was using s three not just for backup, not just for storage of website assets, but they were using it for Hadoop based systems and analytics and and things like that. Now, luckily, we have tons of amazing customers like Netflix who, have data leaders like Eva Zeh, who sat me down in 02/2013, and they gave me a crash course in what does it mean to be the storage substrate for, you know, essentially everything. And, you know, we weren't using the terms data lake in 02/2013, but, you know, that that's what that's what leaders like, I would say, that Netflix were were actually doing. They were using Esri as what they call the source of truth for all of their systems. And whether that was Hadoop based systems or it was streaming or it was what what have you, that is, how s three has been used for years and years, and that is how I learned about data.
[00:05:49] Tobias Macey:
Yeah. S three is definitely, as you said, one of those core services that the Internet, as we know it today, probably wouldn't be able to operate without, at least not as efficiently or as effectively. And because of the fact that it is such a generalized system with such a, I don't wanna say simplistic, but a a very simple, from a conceptual perspective, set of capabilities, which makes it applicable to a broad range of use cases. Many of those use cases do exist within the data ecosystem. And with a focus on that, I'm wondering if you can just give a bit of a summary about some of the role that you see it playing in the kind of data management, data engineering use case, maybe leaving to the side some of the more application oriented roles that it plays.
[00:06:40] Mai-Lan Tomsen Bukovec:
Yeah. The story of s three is is, I I think, a fascinating story, and it's really one that is told by customers who have used s three, you know, basically to write their own story of data. Okay? And so, like, if I think about when s three launched, you know, launched over eighteen years ago, like, our whole goal was to focus on how can we give elastic storage? How can we use how can we provide a storage solution that has all the traits that you want for durability and security and, you know, and and basically a a a an economic model, a cost point, where if you have rapidly growing datasets, you can store them and use them without really thinking about scale. Okay?
And, and what happened is application developers, as you say, kinda jumped on this pretty quickly because they started to say, okay. How can I use this capability as providing us a shared dataset that is operated on by different compute applications? And so applications application owners really were the ones that, you know, adopted S3 super rapidly in those first few years after we launched in 02/2006. But what started off as being a place to store images and web page assets and application storage has has really expanded. And if you if you take a look at that, you can see that, like, the journey of storage is really evolved into storage for any type of data, and that could be operational data that's stored in parquet format. It can be PDFs if you're using it for RAG. It could be log files that your application generates.
It can be really anything that you think of. And so with the growth of the storage, so too has grown, you know, the data processes around that storage. And if you think about that, ten years ago, we had just under a hundred s three customers that were storing more than a petabyte of data. But now we have, like, thousands of customers that are storing that much data. In fact, if you look at the storage in s three, we have, like, we have over 400,000,000,000,000 objects. We have exabytes of data, and we're averaging a 50,000,000 requests per second. And, you know, if you think about that scale, if you think about, for example, that we process over a quadrillion requests every single year, the volume of that data means that it is essentially being used as multipurpose storage. Okay? And when it's being used as multipurpose storage, we have to think about a few things when we're thinking about, like, how do we build storage for basically all of those different use cases, all of those different, you know, types of personas and customers, whether it's a data engineer, a data analyst, you know, an application developer.
We have to think about, you know, how do we obsess not just about, like, the details of security and scale and reliability, but how do we remove the distracting things that stem between developers and data scientists and and creative professionals and their data? And it all just kinda comes down to us to, you know, how do you build the best storage in the planet on the planet for any application developer or data engineer, whether you're an expert in it or you're not. You simply wanna use it. And this is super important because, you know, if you look at the growth of data, you know, IDC, they run these these studies. And and one of them, which is what they call the global data sphere, says that the amount of data generated is gonna grow at over 27% per year. If you think about that, Tobias, if you think about, wow, think about all of that growth of the data. Where is it going right now? What is going into cloud storage? And it's being operated on as a shared dataset across many different types of applications.
[00:10:31] Tobias Macey:
To that point of s three being a shared substrate for all data use cases and the fact that customers use it in such novel and unexpected ways regardless of whether it's something that you ever intended to support, that has driven a number of generational shifts in terms of the feature set and capabilities from the early days of I can just put an object in a bucket, and I can get it back. And maybe I don't even have read after write consistency, so I maybe have to pull to make sure that I get it back to where we are now, where we have that read after write consistency and many other capabilities besides given the focus on data use cases from an analytical or ML and AI perspective.
What do you see as some of those major generational epochs in terms of the capabilities that s three has provided and how that has both enabled as well as been driven by some of those different stresses on the ways that customers are applying that fundamental technology?
[00:11:33] Mai-Lan Tomsen Bukovec:
Yeah. Tobias, I gotta say, if you're gonna use the word generation epochs, you're gonna make me feel super old. Okay? So let's just call it the story of s three. And if you go to the story of s three, you gotta go all the way back to 02/2006. And 2006 is a super interesting year for multiple reasons. It's a year that both s three launched, but it was also the year that Hadoop started as an open source implementation. And MapReduce kinda came into the world. Right? And both of these shifts are super influential because it's the rise of this, what I think of as the modern data pattern, which is where you aggregate vast amounts of your data, and it gets operated on a shared storage.
And what that means is that you're running different clusters of compute. You could be running compute for analytics. You could be running compute for applications. You could be running compute for, you know, inference or pretraining, but you're doing it in parallel against a shared storage set. Okay? And so, you know, Amazon s three made that all possible in 2006 because we brought in multimodal data. We brought in text and image and backup and video at a scale with the kind of cost, point, the low cost economics that I talked about, but also what we think about is Esri traits. Right? And the Esri traits of durability and security and availability and all of that. Okay? So if you're going back into 02/2006, if you think about, you know, the first, I would say, six years of Esri, between 2006 and 02/2012, you had frontier thinking companies like like Netflix and Pinterest and Lyft and, you know, a lot of the startups in that in that time frame, they were the ones that jumped on this super fast. They were really, really quick to see that if you could combine, you know, those traits of s three and you could combine open source, you combine Hadoop and and MapReduce and these new patterns that were starting to emerge on these shared datasets.
You could build a new data pattern and you could do things at scale you could never do before. And, Tobias, I'm gonna send you a link, if you wanna post it for your readers, where Netflix talked about how and this is way back in January 2013, how they used S3 as their source of truth for all of their data processes in the company. And this was way back in January 2013, which means that they had adopted it for in in the years beforehand. And they were some of the first that started, you know, in in, that started to adopt this aggregation modern data pattern. And, you know, that was not just unique to Netflix. You had other companies like Pinterest, which was founded in 02/2010, and they were doing the same thing. And so, you know, I I have been working on s three for, you know, over twelve years now. And I I remember I actually remember when we were, you know, looking at the metrics of usage and we're looking at the metrics of IO, and we started to see these super highly concurrent requests coming into a shared dataset. And it was pretty different from the traffic patterns of website assets and and backups. And so as you mentioned, we had to make a fair number of changes under the hood. Okay? And, yeah, you talked a little bit about consistency. S three was launched as an eventually consistent system.
And we we moved to a strongly consistent system because of what people were doing rapidly in those first few years, of s three and and moving to using MapReduce and and and doing analytics and and doing those concurrent requests. And, you know, it's not just the consistency model that we had to change. We had to, we had to do automatic key partitioning deep within our index system. We had to, introduce new storage capabilities like intelligent tiering that had dynamic pricing built in. And we had to do this not just because these born in the cloud companies like Pinterest and and how Netflix reinvented itself. We started to see in about 02/2013, the first, you know, wave of enterprise customers really start to do the same thing. And it was JPL NASA, which was one of the first customers that started to adopt the same pattern. It was FINRA, the regulatory body of the stock exchanges.
And so, you know, around 2000 and, I would say, 12/2016, we started to see this wave of enterprises start to adopt the same patterns that Netflix and Pinterest and Lyft had started to do. And so, you know, if you think about this time between 2006 and 02/2019, that was really when we started to see this pattern of data aggregation on shared datasets become super common. And, you know, people call that the data lake. Right now, we have over a million of these data lakes running on s three, but the adoption of the the data aggregation pattern was really customer driven. And it was customer driven because it was a combination of the rise of open source analytics with the combination of the growth of storage, and customers just started to do that. And so, like, in the aggregation model, you have producers of data, and they're generating the data that's stored in s three, and it's applications like, everything from, you know, applications that are generating log files to sensors that are sending in data to ETL workflows, and then you have consumers.
And the consumers are operating on that shared aggregated dataset, like the scientists or the app builders or or the data engineers. And these two concepts of production and consumption are are basically they're disaggregated much like how the shared dataset model disaggregates compute and storage. And that is the pattern that has really grown up over the last set of years, and it is the one that is being adopted by scale. And it's being adopted by customers in every single industry, and it has been the pattern that's driven everything from us adding object versioning in 2008 to, you know, cross region storage replication in 02/2015, you know, to what we talked about, what you brought up, which is the introduction of strong consistency in s three.
[00:17:47] Tobias Macey:
And most recently, s three tables for people who are investing in Iceberg as their core relational format.
[00:17:55] Mai-Lan Tomsen Bukovec:
Yeah. I mean, you know, if if you think about the evolution here, Tobias, and, you know, Iceberg is just fascinating. I mean, if you think about Iceberg, Iceberg was a project that was started by developers from Netflix and Apple in 02/2017, and these are developers that grew up on s three. They grew up on s three being the source of truth. And so when they looked at how they were using S3, they were just like, Okay, how do we kind of reimagine the whole Hive concept? Right? And, you know, they they contributed Iceberg as an OTF into Apache in, I think it was 2018.
And by 2020, it was a first level project. And if you think about, you know, you connect the dots there, Tobias, with the growth of, for example, Parquet data. So F Suite today stores exabytes of parquet data. And as you know, parquet is a very compressible format. So exabytes is a lot of parquet data. And today, we have over 15 we have an average of over 15,000,000 requests per second just to parquet data type. It is actually one of my fastest growing data types in s three. And so if you think about iceberg as something that has really emerged as a very, very popular way of interacting with structured data in s three, Yeah. We looked at that and we said, okay. How can we make that better? And, yeah, it was in February, we launched s three tables, which is basically a new bucket type, but it's a built in s three native support for Iceberg table access to your parquet data that you wanna store in your bucket. And so, you know, s three tables makes us the first object store with this built in OTF support for Iceberg, and it's super, super popular. You know, we're making tons of changes to it. We just added support for KMS.
We now support the Iceberg REST endpoint. We have integration for s three tables with AWS analytics services, but we also have integration with Snowflake and DuckDV. And we think this is gonna be, you know, a major step forward for a lot of folks who wanna interact with their per kite data at NetSuite.
[00:20:03] Tobias Macey:
And going back to your point of this being a shared pool of data, single source of truth, something that can easily be aggregated across different use cases, there have been shared data solutions for decades far predating the cloud, thinking in terms of things like NFS and Samba from the protocol layer and then implementations around things like Gluster and Ceph that allow you to have this shared pool of data. And I'm wondering what you see as the fundamental differences between those protocols and technical implementations that allow for that sharing and also shared network drives versus what s three and all of its copycats have enabled in terms of being able to build these massively scalable, massive throughput data systems?
[00:20:52] Mai-Lan Tomsen Bukovec:
Well, you know, I gotta say, Tobias, I I really do think that there's nothing quite like s three. Right? And, you know, I mean, I have worked on it for a while now. And I I know there's there's other approaches on premises, and there's plenty of applications out there that that talk about having s three compatible APIs. But it's kinda it's what's under the hood that that matters. And, you know, if you just take some of those storage protocols that you mentioned, they had to kind of decide they were done at some point. And they've been very slow to move and evolve. And some of the things that we just talked about, including up to s three tables, it kind of you know, it it points to the rate of evolution and the decree of change that s three has done over the last eighteen years. And, you know, I mean, don't get me wrong. Like, in s three, the team is totally fixated on not breaking existing customer workloads, but we've also been able to to quickly learn and adapt and improve at just about every single layer of s three.
And, you know, I I gave you some, scale numbers, you know, the 400,000,000,000,000 objects, exabytes of data. But we also, you know, daily send over 200,000,000,000 event notifications. Okay? And in total at peak, we're serving over a petabyte per second of traffic worldwide through our front end. One petabyte per second, and we are doing over 10,000,000,000 checksums of computation per second. And so if you think about scale, it's it's kinda like what's under the hood of s three that makes it so it it it just makes it so unique. So I've mentioned, in our conversation is that s three customers write the story of data with s three. And I'll just I'll take one example. Okay? So FINRA is a regulatory stock body of The US Stock Exchanges, and they're a big customer of AWS and s three. And so what FINRA does is they regulate member broker dealers in the security markets, and they're responsible for collecting, processing, and and analyzing every single transaction across all of our US based equity and option markets to look for improper behavior, to start proceedings against examples of fraud within twenty four, hours of it happening. And so if you think about that whole mission, think about it from a data perspective, they have to ingest hundreds of billions of records every single day. They have had peak days where they have processed over 906,000,000,000 records, multiple days in a row. And on average, if you think about that from a data perspective, it kinda comes down to 30 terabytes of data they have to process in a single day. And they have this super strict four hour SLA for processing their data. And their workload is very bursting because they're accessing their data very, very hard for a short period of time, this four hour period I talked about. And then they're idle for another twenty hours a day. Not purely idle. They're just they're ingesting new records, but they're doing all of their bulk processing within a four hour window.
So they have twenty hour windows of relatively light activity. Okay. So if you take, you know, a data workload like that and you think about a storage system that has to support it, what is under the hood? That's what matters. And s three is very, very unique because of the way that we approach scale. Alright. So let's kinda talk about a few numbers. And, Tobias, you are going to have to pull me out of the weeds if you think I'm I'm I'm in there too much. Okay?
[00:24:14] Tobias Macey:
Wouldn't dream of it. Okay.
[00:24:16] Mai-Lan Tomsen Bukovec:
Okay. Thanks, Tobias. Alright. So let's just take, you know, a customer. It can be Finra or whatever. But let's just take a customer that wants to hold a petabyte of data, and they wanna process all that data in one hour period. K? Bursty customer. So if I were to access a petabyte over a one hour period, the access rate, the intended access rate is going to be two seventy five gigabytes per second across that data. Right? And if it's about a megabyte per object, that's about a billion objects. So if you think about that, take a step back, it's pretty good workload size. And if you go back to the numbers and tie it together, if you say I have a petabyte of data that I wanna access, a 275 gigabytes per second of peak, if I were just storing that data and not accessing, I would need, under the hood, 50 drives if there were about 20 terabytes per drive, which is roughly the drive size that we're landing nowadays in s three. But if I were gonna access 275 gigabytes per second and 50 megabytes per drive, I would need over 5,000 drives in my system to just support that one workload. So there's a hundred x amplification.
If you're thinking about a drive count, if you think about the access rates over the storage rates, and those drives end up being idle, you know, twenty three hours of a day in this particular example, because I'm only using the capacity, that IO capacity for an hour a day. So if you take a step back and you think about, okay, you know, what does that mean? That means there's a huge amount of inefficiency if you were trying to do that yourself. You're paying for a hundred x the capacity, but you're not using it for 95 plus, percent of the time. So what would this workload look like? And if s three has millions of active customers, isn't this a problem that would be, like, massively worse? A million times worse for s three as a service? But this is where it comes down to what is under the hood of s three. The really cool part of what you get and is very differentiated is how s three works at scale. Again, there is no compression algorithm for experience, and s three has been doing this for over eighteen years.
And what we do is we have certain certain patterns that we think about. Okay? And when individual workloads are really bursty, but we make sure that independent workloads have nothing to do with each other, you can actually manage this at scale. Okay? So, yeah, younger data tends to be hot. Smaller objects tend to be hot. We have lots of young and small objects. We have over 400,000,000,000,000 objects. But you know what? We also have a lot of older and very, very large objects. And when you layer those two things together, those super hot, super small objects for, let's say, analytics, suddenly, the system, you know, layered with with these really large cold objects becomes a lot more predictable. And as we aggregate them, even though our peak demand progressively grows in s three, the peak to mean is collapsing, and the workloads become more predictable. And that is really awesome if you think about the physics of how hard drives work, which we do. We're actually fairly obsessed about them, because we can spread millions, thousands and millions of customers across many more drives than if they ran their storage alone. And we can do that because we can overlap the peaks of some workloads against the quiet periods of others.
And that is really what differentiates s three from everything else out there. It's how we make scale to work to our advantage. We have tens of thousands of customers, probably some of who are listening in today, who have data that's spread out across a million drives each even if their storage footprint does not require that much, which means if you're able to spread across a vast number of spindles of these these spinning hard drives, you're able to burst to throughputs offered in aggregate by those spindles even if the space that you're taking up for storage doesn't really fill up a drive. And because any single customer is a very small portion of a given disk in s three, we have this, like, really nice workload isolations where we decorrelate different customers across their peaks and valleys. Okay? So, I mean, like, honestly, Tobias, I could go on and on. Sorry to geek out on this. It's kind of the essence of Estree and why it's so unique. But, you know, this is the team that geeks out on every last detail of that massive scale right down to how we, I'm not kidding you, wheel in a new rack to a data center. Okay? So some of the rack types we use, they weigh upwards of 4,000, five thousand pounds. That's over two tons. And so if you think about a rack and there are thousands of pounds, they weigh more than a car.
They are so big that when we have to consider, you know, a data center and the design of our data centers for AWS, we consider the weight of the S3 rack because when we have to land them in a loading dock, they have to be able to be wheeled into their final location without literally collapsing the floor along the way. And so in our AWS data centers, we have reinforced flooring to support the weight of moving our s three data racks around. And when you think about where you put your storage and you think about all the engineering right down to the floor of the AWS data center, that's what you get when you put your byte in s three.
[00:29:44] Tobias Macey:
And in terms of the architectural patterns that are enabled because of the massive throughput, the reliability, the other features that are baked into s three as a core primitive, and also the fact of being able to use it as a single source of truth across boundaries of application data, data. So log files, as you mentioned, user uploads of different images, various other unstructured object storage, as well as use cases for in yeah. Analytical data going back to the iceberg use case and the increasing growth of ML and AI use cases on top of that. What are some of the ways that that shared pool of data and the unified access pattern that it offers change the ways that teams have approached the architectural primitives of the systems that they're building given that the fact that they do have s three as that core backing store without having to worry about so much of the, for better or worse, having to worry so much about some of the data patterns that go into it?
[00:30:50] Mai-Lan Tomsen Bukovec:
Well, one of the reasons why I think that data aggregation took off is because when you have an aggregated set of storage, you have a federated data ownership model, and a lot of companies really like them. Right? And, you know, I mean, you can talk about these these companies that were born in the cloud like Pinterest. You know, they yeah. You know, David Shakin, who's one of their engineering leaders, says that they they swim in data. It's like, you know, they they it's like the elements. It's water. It's air to them. And when you have a capability to swim in data, the consumers of that data have a lot of control of how they want to use the data. Okay? And because so much integrates with s three, you can have the consumers of the of the data decide what they wanna use. They can use an AWS service to process the data through, for example, you know, our managed Kafka service. They can use, DuckDB. They can use, you know, basically, this whole rich ecosystem of tooling to interact with the data and and do something with it. And I if I think about, you know, the the patterns that I've seen emerge, I mean, you know, I I think Parquet has has definitely emerged as as a as a default for a lot of different companies for how they wanna interact with their structured data. It's why we enter introduced, s three tables. But I would say, Tobias, that one of the things that I find super interesting is in the last five years, you know, metadata has has really started to take a central role in how either data practitioners or application developers interact with their data. And, I mean, this has been true for a long time. If you go back to to Netflix, Netflix, has has always talked about how critical metadata and metadata systems, have been for their data practitioners.
It is actually one of the reasons why we launched S3 metadata in an S3 table so you can navigate your metadata through an iceberg compatible client. But, you know, I think, Tobias said, metadata is gonna be the data lake of the future. You know, as customers add more and more metadata around data lineage and, data processing and data governance and all the things that, you know, our data practitioner listeners know is super critical about not just finding your data, but also being able to use it and interact with it over time. I think that's where you're gonna see these systems really evolve. And, you know, for us, putting metadata at the s three storage layer as a native construct is is really important because we know that customers are gonna wanna have those same traits of storage. They're gonna they're gonna wanna have durability. They're gonna wanna have, you know, availability. They're gonna wanna have s three kinda openness to interacting with the data while still having the security and the ability to to protect it and use it. And so, you know, if you think about that and you think about let's just take, for example, you know, in late twenty twenty two, you had the rise of these generative AI models come out. And if you think about some of the fastest companies to put into production their AI based solutions, a lot of those companies were using S3 as a backing store. And they they had a very good understanding of what data they were going to use for it. And I'll I'll give you an example. In 2023, Photoshop introduced generative fill, and they introduced it, pretty quickly based on the capabilities in Fireflies. And in Adobe Fireflies, they, you know, create their own models that are are powering the generative, fill application, in Photoshop. And they're doing that because they're they're training on all the stock images, etcetera, that are stored in s three. But they were able to get that capability out really quickly because they were already operating with their data systems in s three. They were already operating with a deep understanding of metadata that they were managing for understanding, you know, how they were doing their training.
And, they were able to get generative fill out there, which is super popular, you know, fastest adopted feature in Photoshop history. And they were able to do it in the first part of twenty twenty three. And, you know, I mean, that's just one example of of companies that were out there with production experiences. Booking.com is another one. You know, Intuit launched AI assistant in their properties like TurboTax so that in the first, you know, in the first quarter of twenty twenty four, customers were using TurboTax, AI assistant, and they were using it with datasets that were stored in s three. And those datasets ranged from, you know, obviously, the history of your tax returns, but it was, you know, an understanding of the changes of the tax model, which was a dataset that AI assistant used and, the decades of understanding that Intuit has built up over best practices.
Those three different datasets were going into AI assisted tax filing for customers in the first quarter of of twenty twenty four. And so, you know, the shared aggregation model combined with, you know, an understanding of metadata has really been behind the scenes driving a lot of what customers are doing with the latest, you know, capabilities of, for example, generative AI models.
[00:36:01] Tobias Macey:
To that point of metadata being such a critical element of understanding what are your data assets, how are they being used, how can they be used. There are also many anti patterns that can grow up through the use of s three because it doesn't enforce any structure in and of itself. And I'm curious how you've seen companies hamstring themselves as a result of being a little bit more careless in terms of how they employ s three, the various patterns and best practices that they either fail to establish or establish in a manner that is blocking them from being able to unlock some of those more advanced ML and AI use cases and some of the ways that teams should be thinking about some of those core design principles of how to structure the storage and access controls and access patterns of the particularly application generated data assets that they are managing in order to be able to then consume them for additional use cases beyond just the ways that they were primarily envisioned?
[00:37:05] Mai-Lan Tomsen Bukovec:
It's a great question. You know, if I think about and we talked a little bit about just the growth of overall data, Tobias, a little bit earlier. You know, that we look at some of the patterns that are out there generally. And, you know, if you look at IDC did one survey on this one, and they found that enterprise organizations, they're generating nearly 60% of the world's data, and that's expected to grow by, to 70% actually by 2026. And it's driven by everything from the influx of IOT sensor data to the install base of video cameras to AI ML to applications.
There's just like, you know, I mean, there's so much data and it's coming from all of these different sources. And as you say, there's some pretty basic stuff that you have to do from a security perspective. And, you know, S3 is secure by default. When you set up a bucket, the only person who can access it is the person who set up the bucket. But, you know, we found pretty quickly that one of the things that was important is to create this capability. We call it block public access, and we launched it many years ago, where it's a control. It's not a setting. It's a control. And the control basically that you can set at a bucket level, an account level, an organizational level is to say, for anything that you add to this bucket now and in the future, you never have public access to it. Super important for enterprises. And what we've done for S3 is that we've actually made it a default now for buckets. Right? And so we will always block your public access until you go in and do, you know like, you can change it. We don't recommend you change it, but you can change it if you absolutely must change it. It's just actually pretty hard because it's a control that we put in place. And, you know, if if you think about that growth of data and you think about a lot of it's happening in the enterprises, the enterprise really cares about a data perimeter. That's what they call it. How can I enforce my data perimeter? Controls like block public access are are are part of that. But we have many other things in AWS that that help you with that. We have this this capability called AWS access analyzer. What access analyzer does is, use automated reasoning, which is sort of like, you know, math, got married to computer science and they had kids, they would be automated reasoning kids. And automated reasoning as as a science powers this whole access analyzer, which will go analyze all the rules that you have in place on s three or or EC two resources.
And I'll tell you where the rule isn't actually doing what you think it is. It it tests the correctness of the rules. And so that concept is starting first with security and having the capabilities like access analyzer to help you with it is is actually one of the reasons why so many enterprises actually do use s three is because you can establish a data perimeter. You can put these controls in place. Now that said, what customers often do is they add this layer. We talked about metadata and how metadata is really, you know, taking off really around as a as a as a data pattern of not just discovery, but understanding the usage of data. And a lot of customers like BMW Group and, you know, Cox Automotive, they're actually building on top of the aggregation pattern and they're creating something that I call curation. It's basically data products. Okay? And in the concept of this data product, they're taking subsets of data in the aggregated model and they're cleaning the data. They're processing their data. They're doing all this stuff. And then they're providing access to those datasets. Again, this is just data in s three. So it's, you know, at the end of the day, it's data in a bucket. Right? Or it's an s three access point or what have you. But they're offering those clean datasets as, quote, unquote, the s three data to the marketing team, to the application team, to, you know, people who wanna use it. Now you're always gonna have organizations in your business who need access to everything in s three. Right? Like your fraud team or your scientists or your researchers. Right? Because they could do all kinds of things if they have access to structured and unstructured data, and they're super inventive and they can come up with stuff. But the pattern that we've really been seeing in the last two to three years is that on top of this aggregated model where you can provide access to a subset of your users, right, who need the raw data set, people are curating the data. And they're providing it through datamarts or you know, sometimes, you know, internal portals. And those subset of data products are then the ones that are driving the business. And they're the ones that the majority of the compliance and the governance is acting on because those are the datasets that are approved for, you know, enterprise use. And that that has been super interesting. That is, again, one of the reasons why we built s three metadata and we're iterating super fast on it because we want people to be able to run an iceberg query on their metadata and really understand, you know, how to find objects, how to, you know, you can you can put user metadata into an s three metadata table. And, that trend, I think, is really gonna pick up because those same datasets are being used for analytics workloads as they're being used for AI workloads, and customers are going to wanna base their businesses on them.
[00:42:17] Tobias Macey:
And on the point of AI workloads, the fact that there are so many unstructured data assets that live in s three that have been accrued over multiple years starting from the initial big data era of just store everything, and maybe it'll be useful sometime. And I think that that time is becoming now, and I'm wondering how that has added new stressors to s three as a service as well as some of the ways that it has unlocked new architectural patterns for the teams that are building some of those solutions on top of the data that lives in s three.
[00:42:52] Mai-Lan Tomsen Bukovec:
If I think about the last few years, Tobias, I mean, it is kinda interesting. I mean, the if I think about, you know, Adobe trains their own models. Right? You know, we work pretty closely with Anthropic. Yeah. We have our own models, Nova models in that Amazon produces as well. And, you know, the the source of the pretraining data is often s three. Right? But the inference driven experience is also driven off of datasets in s three. And, you know, like, a a recent example of that is, LexisNexis just recently launched a capability they called Protege. And Protege is an AI driven, capability, which, uses AWS cloud services including s three. And the goal for Protege was to create this AI driven assisted experience that can change how legal teams work. Okay? And that's corporate legal departments, it's law firms, it's law schools. And the generative part of that is using AI to draft transactional, you know, documents. Right? Litigation motions, briefs, complaints.
You can upload from a dataset perspective, for example, an expert witness deposition, and you can create an AI summary. And a lot of the interesting things about the capabilities of Protege is that it's not just using AI to speed up these legal tasks. It's it's actually customizing the AI experience based on the personal style of the user that's using it. Okay? And so if you think if you bring that back to a data perspective, when you're using a capability like Protege from LexisNexis, it's taking into account the user style. It's taking into account the firm style. It's taking into account the firm's work requirements so that all of that custom data is actually gonna customize the AI assisted output.
And, you know, if you think about it under the hood, these are all agentic systems. Right? And the agentic system that's under the hood is, is basically doing everything from taking those datasets and it's personalizing the output, but it's also doing things like, their AI system is, for example, have a planner agent, right, in their AI system. And the planner agent is basically taking a legal question and it's breaking it down into several steps that can then be processed by other agents. Or it's having an interactive agent that lets the user modify the agent's plan and choose the best course of of action, or it has a self, reflection agent. And the self reflection agent is self evaluating, and it's refining the work of what it's doing to make sure it's drafting a better document.
And so if you if you think about sort of the patterns of the future, we have a lot of humans working with data right now in s three. And so how those humans work with data is they work with metadata and they work with the underlying data and they do, you know, iceberg queries against s three data or they, you know, do a semantic search across, you know, unstructured data. But the evolution of where the data is going with s three is that more and more agents are gonna process the data. It's not just gonna be humans. You know, the humans are always gonna be in the loop. But in these systems like Protege, you have agents that are processing the data, and you have agents that are reflecting upon the quality of the response and how they're using the data, and that is absolutely where companies are going in the future. And whether it's a human interacting with the data or it's as, you know, it's it's an agent interacting with the data, they're all gonna wanna interact with these datasets to either customize the inference or, you know, do the data hygiene, the data processing of those, data products that we talked about. That is a super exciting evolution, and I am, always fascinated and inspired by what customers are doing with this vast amount of data that we have in s three.
[00:46:48] Tobias Macey:
And in terms of the most interesting, innovative, or unexpected capabilities and product used, I'm wondering what are some of the ones that stand out to you?
[00:47:00] Mai-Lan Tomsen Bukovec:
Well, I don't know, Tobias. I am a mom, and I have three kids, and I don't have favorites, Tobias. Okay? So I have, like, so many different customers doing so many different things. I think, you know, I've I've always been super impressed with what Netflix has done with data. When they pivoted to their streaming model, they they did it in a way where they just instrument everything, just everything. And it's like what Pinterest talked about where they swim in data. Customers like Netflix and Pinterest, they're so agile and they're so fast because, they've really aggregated all their data in S3 and they're able to use it in super interesting ways. In fact, if you think about Pinterest, very, very small percentage of Pinterest data is actually the pins.
The rest of it is the operational and the analytical data that drives everything in their super targeted visuals the visual search experience. You know, I think what FINRA has done over the years, I remember the first conversation I had with the FINRA team back in 2013 and they were we were talking about a bucket. We're talking about bucket taxonomy. And you look at how they've evolved over time and it's it's actually super amazing what they've done and, you know, really right down to kind of the economics. I mean, between 2019 and 02/2021, they were able to grow their data grew, I I believe it was threefold in that time frame, but they dropped their unit cost by 50% just by using our storage classes of Glacier Instant Retrieval and Intelligent Tiering.
And so, you know, I'm I think that's amazing, you know, I I think what Adobe has done and and how they've really transformed the the digital professionals experience by generative fill and and all their different capabilities, recently. I I think that's amazing. You know, I I think, I was spending time with Min Chen who's the chief AI officer of LexisNexis and talking to her about how they built Protege and I think that is gonna change how legal professionals work. And, you know, I mean, honestly, I could go on and on to bias, but I am not allowed to pick a favorite.
[00:49:06] Tobias Macey:
And in your own experience of working on such a foundational product for so many people, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:49:18] Mai-Lan Tomsen Bukovec:
I I mean, just so many so many of them. I've been working on these AWS systems for fifteen years now and and s three for over twelve. And, you know, I think for us, it kinda goes back to, really, how do you think what are your mental models and and how do you think about these really, really large scale systems, really large scale systems? And, Tobias, I'll send you a link or your readers might find this this interesting. We have these principal engineering tenants in s three that our engineers use as a guide. Right? They use it as a guide and so far that it helps them break ties or pick how they wanna operate or or how they show up in a project. And, you know, I personally think that tenants are are the most interesting when they're slightly in conflict. Okay?
And so here are the two that I always think of when I work on s three. One of them is called, you know, as as as a principal engineer, you have to be technically fearless. Okay? And so in the wording of the technically fearless tenant, we say, we acquire expertise as needed. We pioneer new spaces, and we inspire others as to what's possible. Okay. So if you take a step back and you think about a system like s three and you think about how many people depend on it as a system of record, you know, you have to be technically fearless. You have to be able to say, you know, what we have today is not enough. It's not enough for tomorrow. We have to keep reinventing what we are and how we can provide the best storage on the planet.
And these capabilities that we added, you know, like s three tables and s three metadata, we have so many other capabilities that we've added in the last, you know, two or three years. We've raised the amount of buckets that you can have. You can have a million buckets in your account. Like, there's there's so many things that we've done in s three over the years because we have to be technically fearless. But there's another principal engineering talent, tenant, and that tenant is called respect what came before. Okay? And in the words of that tenant, we say, we appreciate the value of working systems and the lessons they embody. We understand that most problems are not essentially new.
And so that tenant is really interesting for s three because in s three we have over eighteen years of experience building and operating systems And we know that customers don't want to re architect. They they wanna build. They wanna expand and and evolve their existing capabilities and take advantage of the latest without having to, you know, change what they're doing. And it's why, for example, when we introduce strong consistency, we introduce strong consistency at no extra cost, at no performance deficit for every single request that you make to s three because we know that we have to respect what came before. And so if I think about, you know, the the most interesting and, challenging thing that I have learned while working on s three, it's, you know, how do you take those two engineering tenants of being technically fearless and respect what came before and make sure that everything you do on s three, whether it's some of the geeky stuff I talked about under the hood or it's a new capabilities that we talked about, how do you make sure you deliver that in every change you make to s three? And how do you do that in a way that adopts the latest, you know, science? Okay.
And, you know, we talked about the strong consistency thing. I mentioned the automated reasoning. You know, we have built automated reasoning in so many different parts of s three. In fact, when you do a check-in in our index layer, we're checking for to make sure using a proof. We're checking to make sure that you aren't introducing a regression to our consistency model, and we're doing that with with automated reasoning. And so, like, you know, if if you think about the challenges of being technically fearless and respecting what came before, the challenge for us is how we do things like use math to help us. The challenge for us is to make sure that we keep these these s three traits, these principles of durability and availability and security, and make sure they're built into the mental models and really the culture of everything that this team does.
And that's a challenge, and that is a challenge that every single member of the team of Esri, including myself, that is what we wake up and do every day with Esri.
[00:53:49] Tobias Macey:
And for people who are building data systems, they're designing their data architecture, what are the situations where you would advise against using s three at least as a primary store?
[00:54:03] Mai-Lan Tomsen Bukovec:
Well, you know, we we see us reuse for just about everything. So that's kind of a hard question to answer, Tobias. But if you want sub millisecond capabilities for latency, you know, it's not ideal as a direct store. But I'll tell you, you know, the next generation of these infrastructure startups and these SaaS providers of infrastructure, what we're seeing seeing happen right now is that they're building that sub millisecond layer on top of s three. And they're using s three as a substrate of data for this, you know, for the durability and the and the throughput, but they're caching the data at a higher layer for that super low latency access.
Because if you're using S3 directly, you don't have that submillisecond latency. But, you know, if you think about all of the new, you know, infrastructure startups out there now, you know, there are a lot of them are assuming subs s three as a substrate below that cache.
[00:55:02] Tobias Macey:
And as you continue to build and iterate on and expand the capabilities of s three, what are some of the features or problem areas that you have planned for the near to medium term or areas that you're excited to explore?
[00:55:16] Mai-Lan Tomsen Bukovec:
I think what we're doing with Iceberg is super interesting. It's super interesting and it's super interesting for a couple of reasons. One is because so many of the data lake customers that we have on s three, they have all really moved to Iceberg and Parquet as a standard, right, for how they interact with structured data. And so we are investing pretty hard in our iceberg support. We're doing lots of different capabilities there, so you're just gonna have to stay tuned because we're we're trying to launch many of them across course of this year and then, you know, keep on going from there. I think you should keep an eye on the metadata capabilities in s three. That's another area that we're investing in very deeply. And, you know, if you think about it, Tobias, we could have built metadata with its own API, but we didn't. We built it on top of s three tables because we know as we expand the capabilities of metadata, what won't change is that customers are still gonna wanna use an Iceberg client to query it. So this idea of bringing sort of SQL to your structured data in s three and for us to expand this whole world of metadata and make it accessible to a SQL query, super excited about that. Really excited to see these new data types that are emerging in generative AI and whether it's a checkpoint for, you know, a a frontier model developer or if it's how people are doing vector search on top of s three.
I think that is really exciting. But, you know, at at the end of the day, one of the things that will always be true about s three, we will keep on working on these new capabilities, but we have such a commitment to the fundamentals and the fundamentals of durability and security and, you know, availability and throughput and all of that. I mean, that that is something that is what we call an invariant. We hold that to be true no matter what we do. And so whatever we do in the future of s three, that commitment to the s three traits, that is always going to be a fact. And I say that on behalf of every engineer that works on s three. We wake up every day knowing how deep that commitment is to the byte that you put into the storage service.
[00:57:30] Tobias Macey:
Are there any other aspects of your work on s three, its applications as a shared data substrate for data intensive workloads or any of the other work that you're doing in that ecosystem that we didn't discuss yet that you would like to cover before we close out the show?
[00:57:45] Mai-Lan Tomsen Bukovec:
No. I think we covered a lot, Tobias. I will say that the story of data has always been written by customers in s three that have just started to use s three in different ways. And we have evolved it to always be the best storage for whatever they're doing. And that story, I gotta say, Tobias, that story is just starting. We've been doing this for eighteen years, over eighteen years now, but this story of data is still being written. And it's one of the reasons I listen to your podcast. It's because you have so many guests that are doing so many interesting things with data and how they're helping companies and developers unlock that data.
And so, yeah, we're just super excited to to be the substrate of data for this the story that is being written today by the data practitioners and the application developers and the people who are listening to this podcast.
[00:58:42] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?
[00:58:59] Mai-Lan Tomsen Bukovec:
I do have to say, Tobias, there are so many tools out there. Right? And I think that is one of the reasons why I really like where Iceberg and the community is going is Iceberg really gives this nice capability of being able to to make a choice of whatever tool you wanna use but have some consistency as well. And so, you know, I I don't know. I don't think I would I think choice is great but I think iceberg as a standard way to interact with your data makes that choice even more powerful because you can make a tooling choice but still have a standard that you use. And I think that is why so many of our customers have moved to Iceberg and that is why we have built this native Iceberg capability directly in S3.
[00:59:49] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences on s three, the role that it plays in the data ecosystem, and all of the work that you and your team are putting into making that such a reliable and substantial portion of the data substrate that exists. So I appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day.
[01:00:12] Mai-Lan Tomsen Bukovec:
You too, Tobias. Great to talk to you.
[01:00:21] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
This is a pharmaceutical ad for SOTA data quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of undiagnosed data quality syndrome, also known as UDQS. Ask your data team about soda. With soda metrics observability, you can track the health of your KPIs and metrics across the business, automatically detecting anomalies before your CEO does. It's 70% more accurate than industry benchmarks and the fastest in the category, analyzing 1,100,000,000 rows in just sixty four seconds.
And with collaborative data contracts, engineers and business can finally agree on what done looks like so you can stop fighting over column names and start trusting your data again. Whether you're a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing soda may include increased trust in your metrics, reduced late night Slack emergencies, spontaneous high fives across departments, fewer meetings and less back and forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a 1,000 plus dollar custom mechanical keyboard.
Visit dataengineeringpodcast.com/soda to sign up and follow soda's launch week, which starts on June 9. Your host is Tobias Macey, and today I'm interviewing Milon Thompson Bucabeck about the evolutions of s three and how it has transformed data architecture. Architecture. So, Mylan, for, anybody who isn't familiar with your work, can you start by introducing yourself?
[00:02:15] Mai-Lan Tomsen Bukovec:
Yeah. Absolutely. My name is Mylan Thompson Bukovec, and I am a vice president of technology here in AWS. I run AWS services that are basically the data stack. So everything from the bottom of the stack, the bottom turtle, if you will, which is, you know, Amazon s three and our file systems through to the streaming and messaging services like Kinesis, Manage Kafka, SQS, SNS, and then the analytics capabilities of Asena, Redshift, EMR, and basically the whole top to bottom of the data stack. And I am delighted to be here. Tobias, I've been listening to your podcast for many years, and I'm, I'm excited to talk about how s three fits into data systems.
[00:02:59] Tobias Macey:
Absolutely. And do you remember how you first got started working in data?
[00:03:03] Mai-Lan Tomsen Bukovec:
Well, to be honest, I I came into data management and data storage by by way of s three. And, you know, I've been working on it on AWS since 02/2010, but I was mostly a compute person. I used to run the web server engineering for IS at Microsoft. And, you know, the the way I came to AWS is that, I used to spend a lot of time working on making PHP run on the Microsoft web server. And so I had to run all these PHP open source applications like Drupal and WordPress and Windows Server, And I have to go and and give demos and, show customers and show folks at conferences how to do this. And so, you know, this is back in the day. Okay? Tobias and I I I had to ship servers to do my demos.
And so way back in 02/2008, late '2 thousand '8, I started to use e c two. And I did it because I didn't wanna ship hardware to conferences and to customer sites, and it just kinda made me realize that this is gonna be the wave of the future. And so, you know, I I reached out to some folks that I knew in AWS, in 02/2010, and I and I joined, you know, again, on the compute side. I was actually hired by that, same guy that I work for today. And I I joined to run a set of services like e c two auto scaling and simple workflow, and that gave me a crash course in distributed systems. As you might imagine, I was coming from client server architectures.
And it was in 02/2013. It was early in 2013 that I joined Esri as a general manager, and that meant that I was running the s three service from all the development and operations. We operate in a DevOps model to product. And that is what brought me to data systems, data management and data systems because it was pretty clear in 2013 that everybody was using s three not just for backup, not just for storage of website assets, but they were using it for Hadoop based systems and analytics and and things like that. Now, luckily, we have tons of amazing customers like Netflix who, have data leaders like Eva Zeh, who sat me down in 02/2013, and they gave me a crash course in what does it mean to be the storage substrate for, you know, essentially everything. And, you know, we weren't using the terms data lake in 02/2013, but, you know, that that's what that's what leaders like, I would say, that Netflix were were actually doing. They were using Esri as what they call the source of truth for all of their systems. And whether that was Hadoop based systems or it was streaming or it was what what have you, that is, how s three has been used for years and years, and that is how I learned about data.
[00:05:49] Tobias Macey:
Yeah. S three is definitely, as you said, one of those core services that the Internet, as we know it today, probably wouldn't be able to operate without, at least not as efficiently or as effectively. And because of the fact that it is such a generalized system with such a, I don't wanna say simplistic, but a a very simple, from a conceptual perspective, set of capabilities, which makes it applicable to a broad range of use cases. Many of those use cases do exist within the data ecosystem. And with a focus on that, I'm wondering if you can just give a bit of a summary about some of the role that you see it playing in the kind of data management, data engineering use case, maybe leaving to the side some of the more application oriented roles that it plays.
[00:06:40] Mai-Lan Tomsen Bukovec:
Yeah. The story of s three is is, I I think, a fascinating story, and it's really one that is told by customers who have used s three, you know, basically to write their own story of data. Okay? And so, like, if I think about when s three launched, you know, launched over eighteen years ago, like, our whole goal was to focus on how can we give elastic storage? How can we use how can we provide a storage solution that has all the traits that you want for durability and security and, you know, and and basically a a a an economic model, a cost point, where if you have rapidly growing datasets, you can store them and use them without really thinking about scale. Okay?
And, and what happened is application developers, as you say, kinda jumped on this pretty quickly because they started to say, okay. How can I use this capability as providing us a shared dataset that is operated on by different compute applications? And so applications application owners really were the ones that, you know, adopted S3 super rapidly in those first few years after we launched in 02/2006. But what started off as being a place to store images and web page assets and application storage has has really expanded. And if you if you take a look at that, you can see that, like, the journey of storage is really evolved into storage for any type of data, and that could be operational data that's stored in parquet format. It can be PDFs if you're using it for RAG. It could be log files that your application generates.
It can be really anything that you think of. And so with the growth of the storage, so too has grown, you know, the data processes around that storage. And if you think about that, ten years ago, we had just under a hundred s three customers that were storing more than a petabyte of data. But now we have, like, thousands of customers that are storing that much data. In fact, if you look at the storage in s three, we have, like, we have over 400,000,000,000,000 objects. We have exabytes of data, and we're averaging a 50,000,000 requests per second. And, you know, if you think about that scale, if you think about, for example, that we process over a quadrillion requests every single year, the volume of that data means that it is essentially being used as multipurpose storage. Okay? And when it's being used as multipurpose storage, we have to think about a few things when we're thinking about, like, how do we build storage for basically all of those different use cases, all of those different, you know, types of personas and customers, whether it's a data engineer, a data analyst, you know, an application developer.
We have to think about, you know, how do we obsess not just about, like, the details of security and scale and reliability, but how do we remove the distracting things that stem between developers and data scientists and and creative professionals and their data? And it all just kinda comes down to us to, you know, how do you build the best storage in the planet on the planet for any application developer or data engineer, whether you're an expert in it or you're not. You simply wanna use it. And this is super important because, you know, if you look at the growth of data, you know, IDC, they run these these studies. And and one of them, which is what they call the global data sphere, says that the amount of data generated is gonna grow at over 27% per year. If you think about that, Tobias, if you think about, wow, think about all of that growth of the data. Where is it going right now? What is going into cloud storage? And it's being operated on as a shared dataset across many different types of applications.
[00:10:31] Tobias Macey:
To that point of s three being a shared substrate for all data use cases and the fact that customers use it in such novel and unexpected ways regardless of whether it's something that you ever intended to support, that has driven a number of generational shifts in terms of the feature set and capabilities from the early days of I can just put an object in a bucket, and I can get it back. And maybe I don't even have read after write consistency, so I maybe have to pull to make sure that I get it back to where we are now, where we have that read after write consistency and many other capabilities besides given the focus on data use cases from an analytical or ML and AI perspective.
What do you see as some of those major generational epochs in terms of the capabilities that s three has provided and how that has both enabled as well as been driven by some of those different stresses on the ways that customers are applying that fundamental technology?
[00:11:33] Mai-Lan Tomsen Bukovec:
Yeah. Tobias, I gotta say, if you're gonna use the word generation epochs, you're gonna make me feel super old. Okay? So let's just call it the story of s three. And if you go to the story of s three, you gotta go all the way back to 02/2006. And 2006 is a super interesting year for multiple reasons. It's a year that both s three launched, but it was also the year that Hadoop started as an open source implementation. And MapReduce kinda came into the world. Right? And both of these shifts are super influential because it's the rise of this, what I think of as the modern data pattern, which is where you aggregate vast amounts of your data, and it gets operated on a shared storage.
And what that means is that you're running different clusters of compute. You could be running compute for analytics. You could be running compute for applications. You could be running compute for, you know, inference or pretraining, but you're doing it in parallel against a shared storage set. Okay? And so, you know, Amazon s three made that all possible in 2006 because we brought in multimodal data. We brought in text and image and backup and video at a scale with the kind of cost, point, the low cost economics that I talked about, but also what we think about is Esri traits. Right? And the Esri traits of durability and security and availability and all of that. Okay? So if you're going back into 02/2006, if you think about, you know, the first, I would say, six years of Esri, between 2006 and 02/2012, you had frontier thinking companies like like Netflix and Pinterest and Lyft and, you know, a lot of the startups in that in that time frame, they were the ones that jumped on this super fast. They were really, really quick to see that if you could combine, you know, those traits of s three and you could combine open source, you combine Hadoop and and MapReduce and these new patterns that were starting to emerge on these shared datasets.
You could build a new data pattern and you could do things at scale you could never do before. And, Tobias, I'm gonna send you a link, if you wanna post it for your readers, where Netflix talked about how and this is way back in January 2013, how they used S3 as their source of truth for all of their data processes in the company. And this was way back in January 2013, which means that they had adopted it for in in the years beforehand. And they were some of the first that started, you know, in in, that started to adopt this aggregation modern data pattern. And, you know, that was not just unique to Netflix. You had other companies like Pinterest, which was founded in 02/2010, and they were doing the same thing. And so, you know, I I have been working on s three for, you know, over twelve years now. And I I remember I actually remember when we were, you know, looking at the metrics of usage and we're looking at the metrics of IO, and we started to see these super highly concurrent requests coming into a shared dataset. And it was pretty different from the traffic patterns of website assets and and backups. And so as you mentioned, we had to make a fair number of changes under the hood. Okay? And, yeah, you talked a little bit about consistency. S three was launched as an eventually consistent system.
And we we moved to a strongly consistent system because of what people were doing rapidly in those first few years, of s three and and moving to using MapReduce and and and doing analytics and and doing those concurrent requests. And, you know, it's not just the consistency model that we had to change. We had to, we had to do automatic key partitioning deep within our index system. We had to, introduce new storage capabilities like intelligent tiering that had dynamic pricing built in. And we had to do this not just because these born in the cloud companies like Pinterest and and how Netflix reinvented itself. We started to see in about 02/2013, the first, you know, wave of enterprise customers really start to do the same thing. And it was JPL NASA, which was one of the first customers that started to adopt the same pattern. It was FINRA, the regulatory body of the stock exchanges.
And so, you know, around 2000 and, I would say, 12/2016, we started to see this wave of enterprises start to adopt the same patterns that Netflix and Pinterest and Lyft had started to do. And so, you know, if you think about this time between 2006 and 02/2019, that was really when we started to see this pattern of data aggregation on shared datasets become super common. And, you know, people call that the data lake. Right now, we have over a million of these data lakes running on s three, but the adoption of the the data aggregation pattern was really customer driven. And it was customer driven because it was a combination of the rise of open source analytics with the combination of the growth of storage, and customers just started to do that. And so, like, in the aggregation model, you have producers of data, and they're generating the data that's stored in s three, and it's applications like, everything from, you know, applications that are generating log files to sensors that are sending in data to ETL workflows, and then you have consumers.
And the consumers are operating on that shared aggregated dataset, like the scientists or the app builders or or the data engineers. And these two concepts of production and consumption are are basically they're disaggregated much like how the shared dataset model disaggregates compute and storage. And that is the pattern that has really grown up over the last set of years, and it is the one that is being adopted by scale. And it's being adopted by customers in every single industry, and it has been the pattern that's driven everything from us adding object versioning in 2008 to, you know, cross region storage replication in 02/2015, you know, to what we talked about, what you brought up, which is the introduction of strong consistency in s three.
[00:17:47] Tobias Macey:
And most recently, s three tables for people who are investing in Iceberg as their core relational format.
[00:17:55] Mai-Lan Tomsen Bukovec:
Yeah. I mean, you know, if if you think about the evolution here, Tobias, and, you know, Iceberg is just fascinating. I mean, if you think about Iceberg, Iceberg was a project that was started by developers from Netflix and Apple in 02/2017, and these are developers that grew up on s three. They grew up on s three being the source of truth. And so when they looked at how they were using S3, they were just like, Okay, how do we kind of reimagine the whole Hive concept? Right? And, you know, they they contributed Iceberg as an OTF into Apache in, I think it was 2018.
And by 2020, it was a first level project. And if you think about, you know, you connect the dots there, Tobias, with the growth of, for example, Parquet data. So F Suite today stores exabytes of parquet data. And as you know, parquet is a very compressible format. So exabytes is a lot of parquet data. And today, we have over 15 we have an average of over 15,000,000 requests per second just to parquet data type. It is actually one of my fastest growing data types in s three. And so if you think about iceberg as something that has really emerged as a very, very popular way of interacting with structured data in s three, Yeah. We looked at that and we said, okay. How can we make that better? And, yeah, it was in February, we launched s three tables, which is basically a new bucket type, but it's a built in s three native support for Iceberg table access to your parquet data that you wanna store in your bucket. And so, you know, s three tables makes us the first object store with this built in OTF support for Iceberg, and it's super, super popular. You know, we're making tons of changes to it. We just added support for KMS.
We now support the Iceberg REST endpoint. We have integration for s three tables with AWS analytics services, but we also have integration with Snowflake and DuckDV. And we think this is gonna be, you know, a major step forward for a lot of folks who wanna interact with their per kite data at NetSuite.
[00:20:03] Tobias Macey:
And going back to your point of this being a shared pool of data, single source of truth, something that can easily be aggregated across different use cases, there have been shared data solutions for decades far predating the cloud, thinking in terms of things like NFS and Samba from the protocol layer and then implementations around things like Gluster and Ceph that allow you to have this shared pool of data. And I'm wondering what you see as the fundamental differences between those protocols and technical implementations that allow for that sharing and also shared network drives versus what s three and all of its copycats have enabled in terms of being able to build these massively scalable, massive throughput data systems?
[00:20:52] Mai-Lan Tomsen Bukovec:
Well, you know, I gotta say, Tobias, I I really do think that there's nothing quite like s three. Right? And, you know, I mean, I have worked on it for a while now. And I I know there's there's other approaches on premises, and there's plenty of applications out there that that talk about having s three compatible APIs. But it's kinda it's what's under the hood that that matters. And, you know, if you just take some of those storage protocols that you mentioned, they had to kind of decide they were done at some point. And they've been very slow to move and evolve. And some of the things that we just talked about, including up to s three tables, it kind of you know, it it points to the rate of evolution and the decree of change that s three has done over the last eighteen years. And, you know, I mean, don't get me wrong. Like, in s three, the team is totally fixated on not breaking existing customer workloads, but we've also been able to to quickly learn and adapt and improve at just about every single layer of s three.
And, you know, I I gave you some, scale numbers, you know, the 400,000,000,000,000 objects, exabytes of data. But we also, you know, daily send over 200,000,000,000 event notifications. Okay? And in total at peak, we're serving over a petabyte per second of traffic worldwide through our front end. One petabyte per second, and we are doing over 10,000,000,000 checksums of computation per second. And so if you think about scale, it's it's kinda like what's under the hood of s three that makes it so it it it just makes it so unique. So I've mentioned, in our conversation is that s three customers write the story of data with s three. And I'll just I'll take one example. Okay? So FINRA is a regulatory stock body of The US Stock Exchanges, and they're a big customer of AWS and s three. And so what FINRA does is they regulate member broker dealers in the security markets, and they're responsible for collecting, processing, and and analyzing every single transaction across all of our US based equity and option markets to look for improper behavior, to start proceedings against examples of fraud within twenty four, hours of it happening. And so if you think about that whole mission, think about it from a data perspective, they have to ingest hundreds of billions of records every single day. They have had peak days where they have processed over 906,000,000,000 records, multiple days in a row. And on average, if you think about that from a data perspective, it kinda comes down to 30 terabytes of data they have to process in a single day. And they have this super strict four hour SLA for processing their data. And their workload is very bursting because they're accessing their data very, very hard for a short period of time, this four hour period I talked about. And then they're idle for another twenty hours a day. Not purely idle. They're just they're ingesting new records, but they're doing all of their bulk processing within a four hour window.
So they have twenty hour windows of relatively light activity. Okay. So if you take, you know, a data workload like that and you think about a storage system that has to support it, what is under the hood? That's what matters. And s three is very, very unique because of the way that we approach scale. Alright. So let's kinda talk about a few numbers. And, Tobias, you are going to have to pull me out of the weeds if you think I'm I'm I'm in there too much. Okay?
[00:24:14] Tobias Macey:
Wouldn't dream of it. Okay.
[00:24:16] Mai-Lan Tomsen Bukovec:
Okay. Thanks, Tobias. Alright. So let's just take, you know, a customer. It can be Finra or whatever. But let's just take a customer that wants to hold a petabyte of data, and they wanna process all that data in one hour period. K? Bursty customer. So if I were to access a petabyte over a one hour period, the access rate, the intended access rate is going to be two seventy five gigabytes per second across that data. Right? And if it's about a megabyte per object, that's about a billion objects. So if you think about that, take a step back, it's pretty good workload size. And if you go back to the numbers and tie it together, if you say I have a petabyte of data that I wanna access, a 275 gigabytes per second of peak, if I were just storing that data and not accessing, I would need, under the hood, 50 drives if there were about 20 terabytes per drive, which is roughly the drive size that we're landing nowadays in s three. But if I were gonna access 275 gigabytes per second and 50 megabytes per drive, I would need over 5,000 drives in my system to just support that one workload. So there's a hundred x amplification.
If you're thinking about a drive count, if you think about the access rates over the storage rates, and those drives end up being idle, you know, twenty three hours of a day in this particular example, because I'm only using the capacity, that IO capacity for an hour a day. So if you take a step back and you think about, okay, you know, what does that mean? That means there's a huge amount of inefficiency if you were trying to do that yourself. You're paying for a hundred x the capacity, but you're not using it for 95 plus, percent of the time. So what would this workload look like? And if s three has millions of active customers, isn't this a problem that would be, like, massively worse? A million times worse for s three as a service? But this is where it comes down to what is under the hood of s three. The really cool part of what you get and is very differentiated is how s three works at scale. Again, there is no compression algorithm for experience, and s three has been doing this for over eighteen years.
And what we do is we have certain certain patterns that we think about. Okay? And when individual workloads are really bursty, but we make sure that independent workloads have nothing to do with each other, you can actually manage this at scale. Okay? So, yeah, younger data tends to be hot. Smaller objects tend to be hot. We have lots of young and small objects. We have over 400,000,000,000,000 objects. But you know what? We also have a lot of older and very, very large objects. And when you layer those two things together, those super hot, super small objects for, let's say, analytics, suddenly, the system, you know, layered with with these really large cold objects becomes a lot more predictable. And as we aggregate them, even though our peak demand progressively grows in s three, the peak to mean is collapsing, and the workloads become more predictable. And that is really awesome if you think about the physics of how hard drives work, which we do. We're actually fairly obsessed about them, because we can spread millions, thousands and millions of customers across many more drives than if they ran their storage alone. And we can do that because we can overlap the peaks of some workloads against the quiet periods of others.
And that is really what differentiates s three from everything else out there. It's how we make scale to work to our advantage. We have tens of thousands of customers, probably some of who are listening in today, who have data that's spread out across a million drives each even if their storage footprint does not require that much, which means if you're able to spread across a vast number of spindles of these these spinning hard drives, you're able to burst to throughputs offered in aggregate by those spindles even if the space that you're taking up for storage doesn't really fill up a drive. And because any single customer is a very small portion of a given disk in s three, we have this, like, really nice workload isolations where we decorrelate different customers across their peaks and valleys. Okay? So, I mean, like, honestly, Tobias, I could go on and on. Sorry to geek out on this. It's kind of the essence of Estree and why it's so unique. But, you know, this is the team that geeks out on every last detail of that massive scale right down to how we, I'm not kidding you, wheel in a new rack to a data center. Okay? So some of the rack types we use, they weigh upwards of 4,000, five thousand pounds. That's over two tons. And so if you think about a rack and there are thousands of pounds, they weigh more than a car.
They are so big that when we have to consider, you know, a data center and the design of our data centers for AWS, we consider the weight of the S3 rack because when we have to land them in a loading dock, they have to be able to be wheeled into their final location without literally collapsing the floor along the way. And so in our AWS data centers, we have reinforced flooring to support the weight of moving our s three data racks around. And when you think about where you put your storage and you think about all the engineering right down to the floor of the AWS data center, that's what you get when you put your byte in s three.
[00:29:44] Tobias Macey:
And in terms of the architectural patterns that are enabled because of the massive throughput, the reliability, the other features that are baked into s three as a core primitive, and also the fact of being able to use it as a single source of truth across boundaries of application data, data. So log files, as you mentioned, user uploads of different images, various other unstructured object storage, as well as use cases for in yeah. Analytical data going back to the iceberg use case and the increasing growth of ML and AI use cases on top of that. What are some of the ways that that shared pool of data and the unified access pattern that it offers change the ways that teams have approached the architectural primitives of the systems that they're building given that the fact that they do have s three as that core backing store without having to worry about so much of the, for better or worse, having to worry so much about some of the data patterns that go into it?
[00:30:50] Mai-Lan Tomsen Bukovec:
Well, one of the reasons why I think that data aggregation took off is because when you have an aggregated set of storage, you have a federated data ownership model, and a lot of companies really like them. Right? And, you know, I mean, you can talk about these these companies that were born in the cloud like Pinterest. You know, they yeah. You know, David Shakin, who's one of their engineering leaders, says that they they swim in data. It's like, you know, they they it's like the elements. It's water. It's air to them. And when you have a capability to swim in data, the consumers of that data have a lot of control of how they want to use the data. Okay? And because so much integrates with s three, you can have the consumers of the of the data decide what they wanna use. They can use an AWS service to process the data through, for example, you know, our managed Kafka service. They can use, DuckDB. They can use, you know, basically, this whole rich ecosystem of tooling to interact with the data and and do something with it. And I if I think about, you know, the the patterns that I've seen emerge, I mean, you know, I I think Parquet has has definitely emerged as as a as a default for a lot of different companies for how they wanna interact with their structured data. It's why we enter introduced, s three tables. But I would say, Tobias, that one of the things that I find super interesting is in the last five years, you know, metadata has has really started to take a central role in how either data practitioners or application developers interact with their data. And, I mean, this has been true for a long time. If you go back to to Netflix, Netflix, has has always talked about how critical metadata and metadata systems, have been for their data practitioners.
It is actually one of the reasons why we launched S3 metadata in an S3 table so you can navigate your metadata through an iceberg compatible client. But, you know, I think, Tobias said, metadata is gonna be the data lake of the future. You know, as customers add more and more metadata around data lineage and, data processing and data governance and all the things that, you know, our data practitioner listeners know is super critical about not just finding your data, but also being able to use it and interact with it over time. I think that's where you're gonna see these systems really evolve. And, you know, for us, putting metadata at the s three storage layer as a native construct is is really important because we know that customers are gonna wanna have those same traits of storage. They're gonna they're gonna wanna have durability. They're gonna wanna have, you know, availability. They're gonna wanna have s three kinda openness to interacting with the data while still having the security and the ability to to protect it and use it. And so, you know, if you think about that and you think about let's just take, for example, you know, in late twenty twenty two, you had the rise of these generative AI models come out. And if you think about some of the fastest companies to put into production their AI based solutions, a lot of those companies were using S3 as a backing store. And they they had a very good understanding of what data they were going to use for it. And I'll I'll give you an example. In 2023, Photoshop introduced generative fill, and they introduced it, pretty quickly based on the capabilities in Fireflies. And in Adobe Fireflies, they, you know, create their own models that are are powering the generative, fill application, in Photoshop. And they're doing that because they're they're training on all the stock images, etcetera, that are stored in s three. But they were able to get that capability out really quickly because they were already operating with their data systems in s three. They were already operating with a deep understanding of metadata that they were managing for understanding, you know, how they were doing their training.
And, they were able to get generative fill out there, which is super popular, you know, fastest adopted feature in Photoshop history. And they were able to do it in the first part of twenty twenty three. And, you know, I mean, that's just one example of of companies that were out there with production experiences. Booking.com is another one. You know, Intuit launched AI assistant in their properties like TurboTax so that in the first, you know, in the first quarter of twenty twenty four, customers were using TurboTax, AI assistant, and they were using it with datasets that were stored in s three. And those datasets ranged from, you know, obviously, the history of your tax returns, but it was, you know, an understanding of the changes of the tax model, which was a dataset that AI assistant used and, the decades of understanding that Intuit has built up over best practices.
Those three different datasets were going into AI assisted tax filing for customers in the first quarter of of twenty twenty four. And so, you know, the shared aggregation model combined with, you know, an understanding of metadata has really been behind the scenes driving a lot of what customers are doing with the latest, you know, capabilities of, for example, generative AI models.
[00:36:01] Tobias Macey:
To that point of metadata being such a critical element of understanding what are your data assets, how are they being used, how can they be used. There are also many anti patterns that can grow up through the use of s three because it doesn't enforce any structure in and of itself. And I'm curious how you've seen companies hamstring themselves as a result of being a little bit more careless in terms of how they employ s three, the various patterns and best practices that they either fail to establish or establish in a manner that is blocking them from being able to unlock some of those more advanced ML and AI use cases and some of the ways that teams should be thinking about some of those core design principles of how to structure the storage and access controls and access patterns of the particularly application generated data assets that they are managing in order to be able to then consume them for additional use cases beyond just the ways that they were primarily envisioned?
[00:37:05] Mai-Lan Tomsen Bukovec:
It's a great question. You know, if I think about and we talked a little bit about just the growth of overall data, Tobias, a little bit earlier. You know, that we look at some of the patterns that are out there generally. And, you know, if you look at IDC did one survey on this one, and they found that enterprise organizations, they're generating nearly 60% of the world's data, and that's expected to grow by, to 70% actually by 2026. And it's driven by everything from the influx of IOT sensor data to the install base of video cameras to AI ML to applications.
There's just like, you know, I mean, there's so much data and it's coming from all of these different sources. And as you say, there's some pretty basic stuff that you have to do from a security perspective. And, you know, S3 is secure by default. When you set up a bucket, the only person who can access it is the person who set up the bucket. But, you know, we found pretty quickly that one of the things that was important is to create this capability. We call it block public access, and we launched it many years ago, where it's a control. It's not a setting. It's a control. And the control basically that you can set at a bucket level, an account level, an organizational level is to say, for anything that you add to this bucket now and in the future, you never have public access to it. Super important for enterprises. And what we've done for S3 is that we've actually made it a default now for buckets. Right? And so we will always block your public access until you go in and do, you know like, you can change it. We don't recommend you change it, but you can change it if you absolutely must change it. It's just actually pretty hard because it's a control that we put in place. And, you know, if if you think about that growth of data and you think about a lot of it's happening in the enterprises, the enterprise really cares about a data perimeter. That's what they call it. How can I enforce my data perimeter? Controls like block public access are are are part of that. But we have many other things in AWS that that help you with that. We have this this capability called AWS access analyzer. What access analyzer does is, use automated reasoning, which is sort of like, you know, math, got married to computer science and they had kids, they would be automated reasoning kids. And automated reasoning as as a science powers this whole access analyzer, which will go analyze all the rules that you have in place on s three or or EC two resources.
And I'll tell you where the rule isn't actually doing what you think it is. It it tests the correctness of the rules. And so that concept is starting first with security and having the capabilities like access analyzer to help you with it is is actually one of the reasons why so many enterprises actually do use s three is because you can establish a data perimeter. You can put these controls in place. Now that said, what customers often do is they add this layer. We talked about metadata and how metadata is really, you know, taking off really around as a as a as a data pattern of not just discovery, but understanding the usage of data. And a lot of customers like BMW Group and, you know, Cox Automotive, they're actually building on top of the aggregation pattern and they're creating something that I call curation. It's basically data products. Okay? And in the concept of this data product, they're taking subsets of data in the aggregated model and they're cleaning the data. They're processing their data. They're doing all this stuff. And then they're providing access to those datasets. Again, this is just data in s three. So it's, you know, at the end of the day, it's data in a bucket. Right? Or it's an s three access point or what have you. But they're offering those clean datasets as, quote, unquote, the s three data to the marketing team, to the application team, to, you know, people who wanna use it. Now you're always gonna have organizations in your business who need access to everything in s three. Right? Like your fraud team or your scientists or your researchers. Right? Because they could do all kinds of things if they have access to structured and unstructured data, and they're super inventive and they can come up with stuff. But the pattern that we've really been seeing in the last two to three years is that on top of this aggregated model where you can provide access to a subset of your users, right, who need the raw data set, people are curating the data. And they're providing it through datamarts or you know, sometimes, you know, internal portals. And those subset of data products are then the ones that are driving the business. And they're the ones that the majority of the compliance and the governance is acting on because those are the datasets that are approved for, you know, enterprise use. And that that has been super interesting. That is, again, one of the reasons why we built s three metadata and we're iterating super fast on it because we want people to be able to run an iceberg query on their metadata and really understand, you know, how to find objects, how to, you know, you can you can put user metadata into an s three metadata table. And, that trend, I think, is really gonna pick up because those same datasets are being used for analytics workloads as they're being used for AI workloads, and customers are going to wanna base their businesses on them.
[00:42:17] Tobias Macey:
And on the point of AI workloads, the fact that there are so many unstructured data assets that live in s three that have been accrued over multiple years starting from the initial big data era of just store everything, and maybe it'll be useful sometime. And I think that that time is becoming now, and I'm wondering how that has added new stressors to s three as a service as well as some of the ways that it has unlocked new architectural patterns for the teams that are building some of those solutions on top of the data that lives in s three.
[00:42:52] Mai-Lan Tomsen Bukovec:
If I think about the last few years, Tobias, I mean, it is kinda interesting. I mean, the if I think about, you know, Adobe trains their own models. Right? You know, we work pretty closely with Anthropic. Yeah. We have our own models, Nova models in that Amazon produces as well. And, you know, the the source of the pretraining data is often s three. Right? But the inference driven experience is also driven off of datasets in s three. And, you know, like, a a recent example of that is, LexisNexis just recently launched a capability they called Protege. And Protege is an AI driven, capability, which, uses AWS cloud services including s three. And the goal for Protege was to create this AI driven assisted experience that can change how legal teams work. Okay? And that's corporate legal departments, it's law firms, it's law schools. And the generative part of that is using AI to draft transactional, you know, documents. Right? Litigation motions, briefs, complaints.
You can upload from a dataset perspective, for example, an expert witness deposition, and you can create an AI summary. And a lot of the interesting things about the capabilities of Protege is that it's not just using AI to speed up these legal tasks. It's it's actually customizing the AI experience based on the personal style of the user that's using it. Okay? And so if you think if you bring that back to a data perspective, when you're using a capability like Protege from LexisNexis, it's taking into account the user style. It's taking into account the firm style. It's taking into account the firm's work requirements so that all of that custom data is actually gonna customize the AI assisted output.
And, you know, if you think about it under the hood, these are all agentic systems. Right? And the agentic system that's under the hood is, is basically doing everything from taking those datasets and it's personalizing the output, but it's also doing things like, their AI system is, for example, have a planner agent, right, in their AI system. And the planner agent is basically taking a legal question and it's breaking it down into several steps that can then be processed by other agents. Or it's having an interactive agent that lets the user modify the agent's plan and choose the best course of of action, or it has a self, reflection agent. And the self reflection agent is self evaluating, and it's refining the work of what it's doing to make sure it's drafting a better document.
And so if you if you think about sort of the patterns of the future, we have a lot of humans working with data right now in s three. And so how those humans work with data is they work with metadata and they work with the underlying data and they do, you know, iceberg queries against s three data or they, you know, do a semantic search across, you know, unstructured data. But the evolution of where the data is going with s three is that more and more agents are gonna process the data. It's not just gonna be humans. You know, the humans are always gonna be in the loop. But in these systems like Protege, you have agents that are processing the data, and you have agents that are reflecting upon the quality of the response and how they're using the data, and that is absolutely where companies are going in the future. And whether it's a human interacting with the data or it's as, you know, it's it's an agent interacting with the data, they're all gonna wanna interact with these datasets to either customize the inference or, you know, do the data hygiene, the data processing of those, data products that we talked about. That is a super exciting evolution, and I am, always fascinated and inspired by what customers are doing with this vast amount of data that we have in s three.
[00:46:48] Tobias Macey:
And in terms of the most interesting, innovative, or unexpected capabilities and product used, I'm wondering what are some of the ones that stand out to you?
[00:47:00] Mai-Lan Tomsen Bukovec:
Well, I don't know, Tobias. I am a mom, and I have three kids, and I don't have favorites, Tobias. Okay? So I have, like, so many different customers doing so many different things. I think, you know, I've I've always been super impressed with what Netflix has done with data. When they pivoted to their streaming model, they they did it in a way where they just instrument everything, just everything. And it's like what Pinterest talked about where they swim in data. Customers like Netflix and Pinterest, they're so agile and they're so fast because, they've really aggregated all their data in S3 and they're able to use it in super interesting ways. In fact, if you think about Pinterest, very, very small percentage of Pinterest data is actually the pins.
The rest of it is the operational and the analytical data that drives everything in their super targeted visuals the visual search experience. You know, I think what FINRA has done over the years, I remember the first conversation I had with the FINRA team back in 2013 and they were we were talking about a bucket. We're talking about bucket taxonomy. And you look at how they've evolved over time and it's it's actually super amazing what they've done and, you know, really right down to kind of the economics. I mean, between 2019 and 02/2021, they were able to grow their data grew, I I believe it was threefold in that time frame, but they dropped their unit cost by 50% just by using our storage classes of Glacier Instant Retrieval and Intelligent Tiering.
And so, you know, I'm I think that's amazing, you know, I I think what Adobe has done and and how they've really transformed the the digital professionals experience by generative fill and and all their different capabilities, recently. I I think that's amazing. You know, I I think, I was spending time with Min Chen who's the chief AI officer of LexisNexis and talking to her about how they built Protege and I think that is gonna change how legal professionals work. And, you know, I mean, honestly, I could go on and on to bias, but I am not allowed to pick a favorite.
[00:49:06] Tobias Macey:
And in your own experience of working on such a foundational product for so many people, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:49:18] Mai-Lan Tomsen Bukovec:
I I mean, just so many so many of them. I've been working on these AWS systems for fifteen years now and and s three for over twelve. And, you know, I think for us, it kinda goes back to, really, how do you think what are your mental models and and how do you think about these really, really large scale systems, really large scale systems? And, Tobias, I'll send you a link or your readers might find this this interesting. We have these principal engineering tenants in s three that our engineers use as a guide. Right? They use it as a guide and so far that it helps them break ties or pick how they wanna operate or or how they show up in a project. And, you know, I personally think that tenants are are the most interesting when they're slightly in conflict. Okay?
And so here are the two that I always think of when I work on s three. One of them is called, you know, as as as a principal engineer, you have to be technically fearless. Okay? And so in the wording of the technically fearless tenant, we say, we acquire expertise as needed. We pioneer new spaces, and we inspire others as to what's possible. Okay. So if you take a step back and you think about a system like s three and you think about how many people depend on it as a system of record, you know, you have to be technically fearless. You have to be able to say, you know, what we have today is not enough. It's not enough for tomorrow. We have to keep reinventing what we are and how we can provide the best storage on the planet.
And these capabilities that we added, you know, like s three tables and s three metadata, we have so many other capabilities that we've added in the last, you know, two or three years. We've raised the amount of buckets that you can have. You can have a million buckets in your account. Like, there's there's so many things that we've done in s three over the years because we have to be technically fearless. But there's another principal engineering talent, tenant, and that tenant is called respect what came before. Okay? And in the words of that tenant, we say, we appreciate the value of working systems and the lessons they embody. We understand that most problems are not essentially new.
And so that tenant is really interesting for s three because in s three we have over eighteen years of experience building and operating systems And we know that customers don't want to re architect. They they wanna build. They wanna expand and and evolve their existing capabilities and take advantage of the latest without having to, you know, change what they're doing. And it's why, for example, when we introduce strong consistency, we introduce strong consistency at no extra cost, at no performance deficit for every single request that you make to s three because we know that we have to respect what came before. And so if I think about, you know, the the most interesting and, challenging thing that I have learned while working on s three, it's, you know, how do you take those two engineering tenants of being technically fearless and respect what came before and make sure that everything you do on s three, whether it's some of the geeky stuff I talked about under the hood or it's a new capabilities that we talked about, how do you make sure you deliver that in every change you make to s three? And how do you do that in a way that adopts the latest, you know, science? Okay.
And, you know, we talked about the strong consistency thing. I mentioned the automated reasoning. You know, we have built automated reasoning in so many different parts of s three. In fact, when you do a check-in in our index layer, we're checking for to make sure using a proof. We're checking to make sure that you aren't introducing a regression to our consistency model, and we're doing that with with automated reasoning. And so, like, you know, if if you think about the challenges of being technically fearless and respecting what came before, the challenge for us is how we do things like use math to help us. The challenge for us is to make sure that we keep these these s three traits, these principles of durability and availability and security, and make sure they're built into the mental models and really the culture of everything that this team does.
And that's a challenge, and that is a challenge that every single member of the team of Esri, including myself, that is what we wake up and do every day with Esri.
[00:53:49] Tobias Macey:
And for people who are building data systems, they're designing their data architecture, what are the situations where you would advise against using s three at least as a primary store?
[00:54:03] Mai-Lan Tomsen Bukovec:
Well, you know, we we see us reuse for just about everything. So that's kind of a hard question to answer, Tobias. But if you want sub millisecond capabilities for latency, you know, it's not ideal as a direct store. But I'll tell you, you know, the next generation of these infrastructure startups and these SaaS providers of infrastructure, what we're seeing seeing happen right now is that they're building that sub millisecond layer on top of s three. And they're using s three as a substrate of data for this, you know, for the durability and the and the throughput, but they're caching the data at a higher layer for that super low latency access.
Because if you're using S3 directly, you don't have that submillisecond latency. But, you know, if you think about all of the new, you know, infrastructure startups out there now, you know, there are a lot of them are assuming subs s three as a substrate below that cache.
[00:55:02] Tobias Macey:
And as you continue to build and iterate on and expand the capabilities of s three, what are some of the features or problem areas that you have planned for the near to medium term or areas that you're excited to explore?
[00:55:16] Mai-Lan Tomsen Bukovec:
I think what we're doing with Iceberg is super interesting. It's super interesting and it's super interesting for a couple of reasons. One is because so many of the data lake customers that we have on s three, they have all really moved to Iceberg and Parquet as a standard, right, for how they interact with structured data. And so we are investing pretty hard in our iceberg support. We're doing lots of different capabilities there, so you're just gonna have to stay tuned because we're we're trying to launch many of them across course of this year and then, you know, keep on going from there. I think you should keep an eye on the metadata capabilities in s three. That's another area that we're investing in very deeply. And, you know, if you think about it, Tobias, we could have built metadata with its own API, but we didn't. We built it on top of s three tables because we know as we expand the capabilities of metadata, what won't change is that customers are still gonna wanna use an Iceberg client to query it. So this idea of bringing sort of SQL to your structured data in s three and for us to expand this whole world of metadata and make it accessible to a SQL query, super excited about that. Really excited to see these new data types that are emerging in generative AI and whether it's a checkpoint for, you know, a a frontier model developer or if it's how people are doing vector search on top of s three.
I think that is really exciting. But, you know, at at the end of the day, one of the things that will always be true about s three, we will keep on working on these new capabilities, but we have such a commitment to the fundamentals and the fundamentals of durability and security and, you know, availability and throughput and all of that. I mean, that that is something that is what we call an invariant. We hold that to be true no matter what we do. And so whatever we do in the future of s three, that commitment to the s three traits, that is always going to be a fact. And I say that on behalf of every engineer that works on s three. We wake up every day knowing how deep that commitment is to the byte that you put into the storage service.
[00:57:30] Tobias Macey:
Are there any other aspects of your work on s three, its applications as a shared data substrate for data intensive workloads or any of the other work that you're doing in that ecosystem that we didn't discuss yet that you would like to cover before we close out the show?
[00:57:45] Mai-Lan Tomsen Bukovec:
No. I think we covered a lot, Tobias. I will say that the story of data has always been written by customers in s three that have just started to use s three in different ways. And we have evolved it to always be the best storage for whatever they're doing. And that story, I gotta say, Tobias, that story is just starting. We've been doing this for eighteen years, over eighteen years now, but this story of data is still being written. And it's one of the reasons I listen to your podcast. It's because you have so many guests that are doing so many interesting things with data and how they're helping companies and developers unlock that data.
And so, yeah, we're just super excited to to be the substrate of data for this the story that is being written today by the data practitioners and the application developers and the people who are listening to this podcast.
[00:58:42] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?
[00:58:59] Mai-Lan Tomsen Bukovec:
I do have to say, Tobias, there are so many tools out there. Right? And I think that is one of the reasons why I really like where Iceberg and the community is going is Iceberg really gives this nice capability of being able to to make a choice of whatever tool you wanna use but have some consistency as well. And so, you know, I I don't know. I don't think I would I think choice is great but I think iceberg as a standard way to interact with your data makes that choice even more powerful because you can make a tooling choice but still have a standard that you use. And I think that is why so many of our customers have moved to Iceberg and that is why we have built this native Iceberg capability directly in S3.
[00:59:49] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences on s three, the role that it plays in the data ecosystem, and all of the work that you and your team are putting into making that such a reliable and substantial portion of the data substrate that exists. So I appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day.
[01:00:12] Mai-Lan Tomsen Bukovec:
You too, Tobias. Great to talk to you.
[01:00:21] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Milon Thompson Bucabeck and AWS
Journey into Data Management
The Evolution of S3 and Its Impact
S3 as a Shared Data Substrate
Comparing S3 with Traditional Storage Protocols
Architectural Patterns Enabled by S3
Metadata's Role in Data Management
AI Workloads and S3
Innovative Uses of S3
Challenges and Lessons from S3 Development
Future Directions for S3