From Bits to Tables: The Evolution of S3 Storage

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?

DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems.

Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price.

Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafull

to book a demo and see how they turn months long migration nightmares into week long success stories.

Your host is Tobias Macy, and today I'm interviewing Andy Warfield about s three tables and s three vectors and how they fit into your overall stack. So, Andy, can you start by introducing yourself?

Hi. I'm Andy. Nice to meet you.

And do you remember how you first got started working with data?

Holy smokes. That's a really long time ago,

Tobias. I guess I,

I've worked with computers and software forever. I went to grad school in The UK, and I worked on operating systems

and networks and security and hypervisors.

And so I spent a lot of time really excited about, like, low level bits of the stack, and in particular about, like, how we could make changes there that that let us do cool new things with, with computing, like, like, virtual machines. And I guess I've kinda spent, you know, like, most of my career slowly moving up the stack, getting excited about, like, new problems. And so I went from the virtualization

stuff to doing a lot of work at, like, you know, block level storage virtualization and then working on on s three. And now I spend a lot of time at Amazon working with, not just the the s three folks, but all of the storage teams as well as some of the the teams that consume the storage stack, like, like the analytics folks and the and the AIML teams.

And so

in terms of

the s three tables and s three vectors functionality,

I imagine that everyone who's listening to this has had at least some exposure to s three given that it is largely the backbone of the Internet at this point.

But in terms of the tables and vector functionalities,

they're

a reasonable

divergence from the original objective of s three as being effectively a bag of bits where you could put things and have a good expectation that you'd be able to get them back out again. And I'm wondering if you can just start by giving an overview of what are the goals that you have

in the introduction of the tables and vectors functionality.

Yeah. This is, I mean, this is a question that's that's been on the team's mind a ton over certainly the past couple of years. I don't know that s three set out to be a bag of bits, first of all.

I I occasionally go back and, and read the very original s three PR fact, like, the the thing that that the folks wrote when when s three first launched about what it intended to be. And,

even in that doc, they they talk about managing all the world's data and about being a file system for the Internet.

And so I I do kinda think that s three has always sort of, you know, really just aspired to make data as effortless to work with as possible. I think the thing that's been really interesting about that is over the first bunch of years that I worked on s three, I think we were still learning a lot about just scaling the object workload.

And, you know, we

were growing like crazy. We were doing a ton of of doubling down on making sure that, like, we deeply, deeply

understood the data path and the durability path and stuff inside the system. We stepped into

s three Express

because

we were increasingly see seeing folks

move

to more actively using s three for things like analytics

and,

ML training data and stuff like that. And then over

over the, you know, I guess, like,

I wanna say, like, two or three year period leading up to the launch of s three tables last December,

we'd already seen loads and loads of analytics workloads, especially on top of parquet.

And the move to open table formats and especially Apache Iceberg

was really steering us into a lot of customer conversations where folks were using s three as, like, a table store for for databases.

And so I think the kind of, like, interesting,

I guess, sort of, like, slow

moment that the team had. I don't know that it was really a moment at all, but, like, we, I think, slowly came around to the idea that s three was,

was not specifically objects, and it was more a storage system that had these great properties in terms of elasticity and durability and scalability. These these things that the team kinda internally calls, like, the fundamentals.

And that we should explore

using other data types because the fact that we didn't have support for more advanced data types was a source of friction that customers were having to build on top of.

And so as you mentioned, s three tables was launched a few months ago. That has been out in the wild for long enough now for people to start investing heavily into it and build their own use cases around it. And

since we didn't really define what these things are, I'll give the brief overview of s three tables is

an iceberg native

storage medium for s three as well as now a

Iceberg

REST interface as well for the catalog implementation

and s three vectors is

a similar paradigm, but for vector storage, which has obviously become all the rage because of the recent hype around AI and semantic search and retrieval augmented generation, etcetera.

And I'm wondering

how the experience of building and launching s three tables helped to inform

the work that you've been doing with s three vectors up to its launch?

That's a great question. So it is absolutely not true that we shipped tables and then, like, decided what to do next and shipped vectors. We shipped tables about seven months ago, and and vectors took a little bit more work than that. So we were working on the vectors

project at the same time that we were working on tables. Tables was a thing where

a lot of

especially our largest analytics customers had grown reasonably sophisticated

workloads on top of on top of parquet,

and they wanted to be able to do, like, fancier things, like all of the stuff that Iceberg really provides, upserts, right, being able to do upends into existing tables more cleanly, being able to manage performance for really large tables, change schemas,

stuff like that. And so a lot of the launch there was Iceberg already existed, and people were able to use Iceberg on top of s three. And the pull that we got into launching tables as a sort of managed

table construct was that it wound up being a really common story from customers

that Iceberg was a lot of work to run. Right? That, you know, they had to schedule compaction jobs. They had to think hard about performance and sort orders and stuff like that. In a few cases,

having all of the iceberg metadata and data files exposed in the same bucket meant that they, you know, did their own work on compactors and got themselves into, like, mixed up references that that meant that they couldn't access their data and stuff like that. And so we kept having this customer conversation where folks were like, you know, I I feel like suddenly I'm managing my storage system, and isn't that your job? And so that was the thing that led us into into tables. I I think the way that the iceberg

and tables were really influential

on what we've done with vectors

is,

with these higher level constructs because they're they're all kind of built on top of objects

under the covers. The table construct

does appends

and writes to the table as a sort of copy on right thing. Right? Like, Iceberg creates Git like snapshots

with updates. If you do lots of small writes, you get lots of commits into the table. And then there's a background compaction job that tears through and reads everything and writes it back out in a in a format that recovers deletions and also makes it a lot more efficient to do reads. If you look at the sort of, like, vector workload, it has a lot of the same properties.

And so for vectors,

internally, there's a similar kind of thing of being able to do efficient appends and then having an internal compaction task that does all of the vector indexing

and puts the vectors into a format that's that's more efficient for query.

One of the interesting things with vectors

is that it is

not as well

defined as far as the workloads that people are building with vectors because there's still a lot of flux in terms of how people are building with it, which vector implementations people are building on top of. There are vector storage where it's just give me a way to store

n dimensional arrays. There are vector indexes, which provide

fast lookup and scans across that underlying storage. There are plugins to existing database engines, PG vector being the most notable.

And then there are fully dedicated vector databases that are optimized top to bottom for vector storage, indexing, and retrieval.

And, obviously, s three

as a medium for storing vectors is not an end to end database because it is also a more generalized system that has vectors as one of its capabilities. And I'm wondering how you're thinking about the role

of s three vectors in that overall context of the

vector storage indexing

discovery and retrieval

that people are building?

Well, this is, another really good question about of a stuff. You're kinda, like, reading into a lot of, a lot of interesting history. So the one of the things I love about the s three team the entire time that I've worked on the team is,

we have heated discussions about API. The team, especially folks that have been around for a long time,

kinda know that, like, the the API end of s three is is the one thing that we will live with for a really long time. And so there's a lot of a lot of effort to get that right and a lot of back and forth on things. And with vectors, as you say, it's it's still kind of emerging

the shape that that a vector API needs to take. And in particular,

the thing that you mentioned about embedding models is absolutely true. And so, you know, for, for folks that aren't as, as familiar, maybe it, it makes sense to take a step back on the vector thing. We have seen in s three and Amazon storage, generally, we work with customers with every imaginable type of data at every scale. And so it's really, really fun. It's, like, so exciting for me because I get to I get to see all of these different use cases and understand,

you know, the the challenges that folks have working with data. And vectors are a thing that, you know, I don't know, five years ago weren't really coming up in conversations. And that over that period of time, we've seen come up in all sorts of domains.

The high level workflow around

vectors, a vector is just literally like a mathematical vector. It's a multidimensional space. If you imagine, like, a two dimensional vector, it's literally like an x and a y. Those numbers are usually floating point numbers, although they're often low precision to save space. And what you do is you encode a piece of data and you place it inside that coordinate space, the vector space. And so if you imagine a two dimensional vector space, like a graph that, you know, like a x y plot that you could draw on a piece of paper, I could take a picture of a dog. I could pass it through what's called an embedding model, an ML model specifically designed to generate vectors. And that model would spit out some coordinates in that space, like 5.1, 4.2. And that location in the space would be where objects or images that were similar to that first dog image that I put in landed. And as I went and scanned in loads and loads of other images and and generated embeddings and put them into the space, a good embedding model would place similar things close together inside the vector space. And so I could go if I wanted to search for something, I could take a different dog picture, and I could,

generate its vector. I could query for nearest neighbors of that vector, and I would get a bunch of other points that let me see other

related,

pictures, hopefully, of similar looking dogs. Now none of the vector embedding models can can really, like, embed enough information in two dimensions, and so it's very common for these models to have anywhere between about 502,000

dimensions, which is kind of fascinating to me because I stopped being able to wrap my head around what a vector space looks like in about four dimensions, like really three and almost four. And so the idea of these, like, you know, 1,500 dimensional

vector spaces is kind of like a a crazy thing to think about. But then you you basically put stuff into that space. It clusters together with related things. The embeddings allow the model to basically

assign semantics

across a lot of different areas. And so, you know, that dog picture might also have a pine tree in it, or it might be in front of my old house, or there might be, like, a fire hydrant. And if I went and generated an embedding for another picture that just had a fire hydrant, that dog picture might also show up because it's close in another aspect of the of the super high dimensional vector space. And so a thing that, that we kind of, like, found with the design

was the embedding models are undergoing a ton of research right now, and they're changing all the time. And new embedding models are emerging that are, like, more and more awesome. As one example, there's a there's a startup called Twelve Labs that makes this really, really cool embedding model for video. They just launched their model on,

on Amazon Bedrock. And with with that video embedding model, you can take, like, hundreds of hours of video, and you can go punch in a query for something like a person doing push ups. And the neat thing with some of these models is that they're multimodal. And so you can type in that question as text, and it will, like, literally return you a set of offsets

of scenes because it generates vectors for every scene, that have that kind of content in them. And so it's one example of a vector index in search where you see folks in media like broadcasters

being able to take a model like that, take, like, you know, thousands and thousands of hours of archival footage,

generate scene level vectors,

and quickly be able to, you know, from a from a string search or or a captured image, build highlight reels, right, or montages

of related content in their archive. And so that that's that's kind of one example. So when we when we designed the API,

to get back to your question, we decided that the embeddings would probably have to be externalized. We didn't want to impose,

like, our choice on embedding model because we're seeing customers switch between embedding models, in some cases, build their own embedding models. But we did decide that rather than strictly storing the vectors, that it was important to offer a query capability, right, to do approximate nearest neighbor query because we were seeing virtually every customer that we talked to want to do that. They didn't wanna, like, have to build their own index file format and scan through the vectors externally and do all of that stuff. And so we decided to do this cut line around storing, you know, effectively any dimensionality of vector, indexing and doing queries on it, building that as an s three construct. And the thinking is basically that this is a building block

that really lights up vectors

as a sort of connective tissue between all sorts of applications and arbitrary datasets. And so so that's that's kind of where the API line is and and what our thinking was behind building.

Another nuance of the overall space

in terms of vectors is the question of to your point, there can be different levels of precision of the floats that are incorporated into those vectors, but then there's also the question of whether or not you support sparse or dense vectors,

namely whether or not any of the dimensions or entries within those dimensions are allowed to be null or nonexistent.

And I'm wondering

how you've

tackled with some of those details of Nuance around the types of embeddings that you want to support, particularly given the fact that it is a field that it is in motion and a constantly moving target.

Yeah. Totally. I mean, I think I think what the team's done so far is,

well, so we we we first of all spent a bunch of time talking to existing, like, Vectorstore builders. Because a lot of them actually build on top of data that's in s three because they're generating vectors on that data. And one thing that sort of we anticipated upfront with the design was that a lot of these existing vector stores have, like, really, really low latency in their design for for, like, a very high level of performance.

And I think a bunch of that is on purpose, but, also, a bunch of that is a necessity out of the way that vectors are indexed and have been indexed in the past. Typically, to be able to navigate and find nearest neighbors in these, like, ultra high dimensional spaces,

you use a an index that is like a directed graph, and you kinda walk through this directed graph to find nearest neighbors. But a consequence of walking through that graph is you're round tripping to memory. Your walk may take, you know, like, many, many hops to get through the graph to find the appropriate neighbors.

And running against s three where the latencies to storage are are potentially, you know, like, talking to hard disks, they're they're a lot higher than talking to DRAM on your local host. We couldn't use any of those types of indices. And so we kind of, like, got to a spot where we wanted to

respond to the fact that we were seeing customers that built on top of some of the existing approaches to in memory indices

running into cost problems with really large datasets where they weren't sustainable over time, especially for vector stores that were only, like, you know, accessed for bursts of activity and then quiesced. And we wanted to give those things queryability and a cost basis that was much closer to something like s three storage. But we also, out of the gates, wanted to still be able to serve those workloads that were, like, super, super performance demanding. And so we ended up talking to a lot of the existing vector stores. As as an example, we we spend a ton of time talking to the open search team inside AWS

and kind of came to a position where we we decided that we would build a thing that could, hydrate into, like, a faster provision store and serve things at at that level and then quiesce down into s three when it wasn't being used as you needed to get down to that, like, storage level cost basis. And so from a workload perspective,

supporting a lot of the embedding models that those existing stores were using

kind of dictated

a lot of our initial features, which are which are, like, like, really around vectors, vector queries, like, straight up vector queries with a little bit of filtering

and the types of vectors that we see from the most common embedding models in customers and the ability to, like, quiesce and get down to a lower cost, which we think really opens things up for for broader use of vectors.

And then

to your point of

the

various vector engines

largely feeding off of content that

resides in s three

as far as its origin and then feeding it through these embedding models,

Now that you have the ability to

natively store and understand these vector types with s three as the storage medium,

how does that influence the ways that these other vector store systems

are

considering

the ways that they interact with s three. And in particular, I know that, for instance, with streaming systems like Kafka, they usually have a means of being able to age out content to s three so you have effectively infinite durability without having to keep all of that data resident on disk. And I'm wondering if you're seeing any of the

similar patterns in these vector engines for being able to

use

s three vectors as effectively an infinite buffer of data, but only keep some of it resident on disk for the hot path?

Yeah. Totally.

I don't have, like, a, I mean, the feature launched, like, three days ago. And so,

we don't have a ton of, like, super mature integrations at this point, but we've certainly been in a beta for the better part of a year with customers and some partners around it. That pattern that you're talking about of, it's kind of like a a CDC style pattern of of aging out archival data has come up in a ton of of conversations.

It's interesting

that for,

it's it's like one of the places that I was surprised to see vectors turn up. But, like, I've seen folks using vectors to spot similarities across time series data. And so even on the straight up streaming side, we're seeing, like, startups talking about integrations where they would, you know, keep in the last whatever number of hours

and then stage out, and they would maintain, you know, sort of like a active set of vectors for the stuff that they hold and then destage those vectors out to s three for more archival queries. See the same thing on, on OLTP style database integrations with with tables wanting to maintain vector indices at the same time.

Another interesting element of this is in juxtaposition

to s three tables where you had a concrete target in the shape of Iceberg because Iceberg as a

specification and as an implementation

had a fairly mature ecosystem around it. There were lots of implementations and engines that used it, so you had a pretty good understanding of the types of patterns that you wanted to support.

And in terms of the vector space, the

only real analogous

specification and implementation that I've seen as far as using vectors on storage, yes, is the the Lance table format, which also has LanceDB

as the query engine for that. And I'm wondering

how you have been looking at the overall ecosystem,

particularly with Lance as a,

reference and thinking about

the ways that you want the s three vectors implementation

to

be a good citizen of the overall ecosystem without reinventing the wheel just for the fun of it?

We absolutely do not want to reinvent the wheel just for the fun of it. The iceberg stuff has been a really interesting sort of, like, learning experience for us in that, as you say, we've had customers building on top of Parquet on s three forever. Iceberg initially launched as a file format definition. Right? There was no iceberg API or anything like that at launch. And I think that the the iceberg community

actually kind of found that in the original version of things, they were having to maintain,

storage specific library code for

the things that they couldn't define in the file format. So Iceberg, to commit a new snapshot, needs an atomic

swap operation on the root node, as an example. And so they had, like, whatever, like, four or five different back ends for Iceberg that knew how to do that operation

and, like, knew how to interact with, like, whatever storage type was was behind Iceberg. And as that progressed,

I think, you know, as they wanted to move faster with certain features

and they wanted to be able to, like, integrate with lots of clients, I think that when learning there

was that having to push a client software update to a 100% of accessors to all of your, like, spark clusters and things like that, Whenever you made a change that impacted

either how you interacted with the data on disk or the structure of the data on disk was proving to be a real challenge. And on the Iceberg side, that gave sort of motivation to this Iceberg REST catalog API, which

suddenly is like a REST API that that allows you to talk about intention instead of directly integrating with the file format. And Iceberg right now is kind of this hybrid thing where there's the IRC, but there's also published specs for the manifests and the parquet underneath it. And so it puts us in a neat spot with customers where customers love the property of Iceberg in terms of durability, of being able to see a 100% of the of the files. Right? They can potentially go through and walk the snapshots and just back those things up if they want to. They they like

knowing the structure of the format and being able to work with it. However, they also like the rest API end of things in terms of the ease that it brings to integration

and being able to, like, move between versions and updates without having to, like, go and deploy a whole bunch of code.

On the vector side, like you say, there was the the lance format. There's a couple of other, like, I think, less mature

formats for vectors, but there was nothing

like iceberg that fully cooked through, you know, compaction,

all of the the ways that we wanted to take it. And in particular, we were pretty sure that the way that we were doing indexing

would change as we learned and especially as we got experience with with customer workloads. And so we decided on that one, on vectors, to go in the path of starting with the REST API for things and kinda seeing where stuff takes us. And so that's that's where we are on it. I think if if a format emerges that is, like, really, really universally adopted for representing vectors, I I I think we'd be, like, pretty excited about, about exploring doing something like what we did with Iceberg.

And then as far as the

technical implementation

of

s three vectors,

I'm wondering if you can talk to some of the ways that you've had to

build layers on top of that substrate of the base s three capability and some of the ways that the overall s three capabilities

factor into some of the features that the s three

vectors functionality

provides? And I'm thinking maybe in terms of the automatic life cycle management where you can either age out old vectors and delete them or cycle them into

the infrequent access mode or glacier and things like that? You're you're you're, like, ruining all my future launches. The vector layer if you wanna think about the implementation,

I'm trying to think about, like, how how much detail is useful to to go into with you on this. So a lot of what is happening in terms of the actual data structure in there

looks like a log structured merge tree, which is to say that updates arrive,

and we're we're trying to minimize the amount of of IO that we have to do to be as efficient as possible on both the right path and on the repath. And so we want to work with young data to merge it and then get as little total number of objects underneath there to serve queries.

And so the put API for vectors allows you to put a batch of up to 500 vectors. Those things accumulate and trigger a compaction task that folds them together.

And then we do a whole bunch of math to calculate the clusters of vectors and the indices and then quiesce them into,

like, more, like, large well indexed structures.

And we do a similar path on on deletions

for them. In terms of, like, the actual index structure and how the data is accessed, because it's s three and we we don't have ultra low latency

access to the storage inside the system, and we really wanted to anchor on that cost basis.

We tend to

borrow ideas from some of the,

the vector algorithms,

index algorithms that have been published for,

for that kind of storage, for higher latency storage. And and the distinction in those algorithms is they they tend to trade off

achieving fewer round trips to storage

with reading more data in parallel, which is actually a strength of s three. And so when you do a search, we kind of find the,

the broad areas of the vector store where that data is. And, we read all that stuff, and then we we walk through it in memory to get the nearest neighbor results and and return the nearest neighbors. And so that's kinda what's behind the, the query API. Now I think

this is a space where as we understand and a lot of why we launched as a preview was to really understand the usage and what the performance requirements were. And so I think that as we get a better understanding there, we're likely to make a bunch of changes to how all that's implemented

to really, like, drive into,

especially, the performance requirements of the workloads that we're seeing.

And then as far as the

index management,

there are

multiple implementations.

HNSW

or hierarchical navigable small worlds, I know, is one of the more popular ones.

And I'm wondering how you're thinking about

the functionality in s three vectors as far as being able to

evolve those implementations

as new vector index styles emerge or as new query patterns emerge and some of the ways that the interface to s three vectors needs to accommodate these

changing patterns,

the changing

style and size and format of the vectors, etcetera?

The question you're asking is the thing that I think the design actually anchors on the most. The team

intentionally chose APIs that anticipated the fact that we were probably not gonna get the index right at launch. And I don't know what the count on the number of index redesigns that we've gone through since we started working on the thing is, but it's, it's a few. Right? We've we've, like, you know, redesigned the, the index to

to reduce

over reading,

and to reduce round trips to storage, and I think we're still learning there. And so a lot of what's attractive and and this is similar to what you see with with Iceberg

in terms of compaction,

and even Iceberg in terms of, version changes on the, on the Iceberg spec, is it anticipates this idea to be able to go in and and rewrite

a lot of the underlying data during compaction and move forward without really impacting

clients. And so the and the vectors implementation, because it doesn't expose the underlying file format, is even is even kinda better on that because it's completely described by our REST API. And so for the moment, I think that we will probably do

lots of evolution

in there, like I said, as we learn.

And then one of the other fun things about vectors

is that

in isolation,

they can be useful, but you typically want to be able to reference them in conjunction

with other information that you have,

whether that is

metadata that you want to link as attributes for being able to do

filtering

prior to doing an index scan on the vectors themselves. But I think more interestingly,

epitomized by the PG vector use case, you want to be able to

use the content in your

structured data

to be able to then also reference some of those vectors.

You have s three tables. You have s three vectors. And I'm wondering how you're thinking about the potential for them to interoperate

and be able to provide some linkages of data that resides in those two different features?

Totally. So and this is absolutely where we are heading with this stuff. And so first of all, a useful thing and a lot of what really, like, kicked off the the vector stuff in the first place. But I I feel like this is like it was a point that surprised me. And I think it's, like, a really neat bit of intuition to grow about vectors and data is that, number one, like I said, we're seeing, like, an enormous number of use cases for vectors. Obviously, doing retrieval augmented generation, the rag pattern against,

against, like, LLMs or or other foundation models is,

is a pattern where you use vectors as a bridge to let the model go and search a data set and find relevant data to answer a prompt. We also

see vectors used

in all sorts of other, like,

not even directly,

AIML applications. Right? So there's there's there's really, really interesting cases of, of vectors being used in medicine,

especially in medical imaging, to quickly take an image, do, embedding and search on it, and bring up a bunch of other cases

that have images that have similarity to that image. So it's like a really fascinating diagnostic tool. We see it used in,

in genomics,

to do a very similar thing where you've got, like, some kind of disease or set of, like, like, individual

level properties, and you wanna go and find similarities

for that genome

and see whether that property is expressed in other folks. And so we're seeing it used in drug development. We're seeing it used in, like, molecular similarity also in drug development and stuff like that. The way that vectors are being applied, especially in science, is super cool, and it's fun to read about. It's been, like, a really exciting part of of working through this stuff. When you think about data in s three, s three has all of that data. Right? We have a huge set of genomics customers.

We have loads and loads of media and entertainment customers. We have customers in finance that wanna use vectors for fraud detection

and and scoring and stuff. There's there's all these neat applications.

What we did last year at re:Invent was on top of tables. We launched this feature called S3 metadata.

And what metadata does is it implements an s three table on behalf of the the customer's bucket that the service maintains.

And so it's an iceberg table. As you write data into the bucket, as you put objects in the bucket, s three metadata comes through, and,

you can configure it to produce either a journal of updates and deletions in the bucket or a table of the contents of the bucket. But because it's managed by the service, it has integrity. You can't go in and edit the contents of the table other than by modifying the objects that are in the bucket itself. And so by doing this, you get this, like like, index that you can now issue SQL queries against to pick operations in the bucket. And that's a thing that we're excited about continuing to plumb into other operations inside s three. And I think that, you know, one direction that we're we're obviously exploring is is associating

vector embeddings

with that same metadata layer. And so really providing,

developers with the ability to, like, stitch together all of these different, like, views of their data into a single metadata store that you can then, like, integrate whatever tools you want against to be able to work with that data.

So for people who are

building

with vectors, whether they have

adopted one of these

vector databases,

they're doing

something like a PG vector add on. How are you

seeing from your early adopters

the

modification

of any of the architectural

or access patterns that they might have around vectors?

And I'm wondering, in particular, given the fact that the embeddings are typically something that you have to regenerate

with some level of periodicity,

either because you're adding new data or because you decided that you need to change your embedding model or change dimensionality,

how they think about the role of s three vectors as maybe an archival store of experimentation.

There was so much in what you just asked. So, I'll I'll call on my view on a few of the things you just said. One is I've definitely had the customer conversation that matches exactly what you said around, like, I just have to regenerate embeddings every bunch of months because I'm like in some cases, I have a team that's building an embedding model, and, you know, it's it's not good enough. And I'm just refining and refining and refining to get better results off of it. I've also talked to customers who have, like, a crazy amount of data,

and running all of that data through an embedding model produces an enormous amount of new data and is quite a bit of effort and cost. On the the intuition on the on the, like, vector side is for for text data in particular,

vectors almost express a sort of, like, information density on things. It's it's kinda neat. So when you do vectors on scenes in a movie or on images, you know, like, you're

you're indexing something that's probably, like, hundreds of kilobytes to megabytes,

and you're producing a vector that's, like, like, four k. You know, it's like a it's it's an expensive index, but it's it's comparable to other indices that that you might think about building. When you're indexing text, it's not uncommon to generate vectors at, like, a 300 token boundary or paragraphs or or whatever. And now you're taking, like, literally, like, tens or hundreds of bytes

and also generating something that's, like, potentially a four k index entry. And so that ends up I mean, it's one of the reasons that that we saw customers want to not have to store all their vectors in in SSD or DRAM all the time, but it's also something that that kinda hints at the cost of generating the embeddings. And so we wanted to we wanted to be able to, like, handle both of those cases and provide something that that works for them, but that's a reality of what we're seeing. The other thing that you said that that I think is totally true is we see some customers

that are working with with reasonably mature pipelines where they they have an ingestion path, they generate embeddings, and they're putting those embeddings into a large vector database, and they're doing search all the time. We have other customers that are doing a ton of experimentation.

And a common path is, you know, there's there's a pile of embedding models and datasets on Hugging Face.

It's really fast to take that stuff. You know, last weekend, I I I was I was, like, doing load testing and and usability, kind of, like, playing around with the final vector implementation. And I I went on and grabbed a a dataset, the IMDb review datasets, like 25,000 IMDb reviews, and used, sentence transformer, which is an embedding model for Python that runs on the CPU on your laptop, generated embeddings for all those, put them into s three vectors, and worked with that. And at the smaller end of the spectrum like that, you know, 25,000

vectors,

we see a lot of desire to be able to do that and not need to stand up, you know, like, a full instance of a of a vector DB to kind of, like, learn and and experiment with vectors. And so I think, like, both ends of that are there, and I think we've tried to build a primitive that serves, like, both

the the early adopter tire kicking experimentation end, the more large scale end of things. And that kind of leaves a lot of the, like, really rich or super high performance vector functionality

as a thing that we can, like, tier up into for customers that need that.

I'm imagining that this will evolve to a point where it is

similar in terms of usage patterns to Iceberg where

you have

Iceberg or s three vectors as the bulk of your overall data because it's inexpensive

to store and convenient to access.

But then for things where you do want to have that

low latency, high interactivity,

you will

either

transfer that into a more dedicated engine or use some sort of

dynamic caching capability. I'm thinking in particular as far as something like what StarRocks offers for Iceberg to be able to query Iceberg tables, but then be able to maintain a cache of them to improve the overall latencies.

And then also to your point of reducing the barrier to experimentation or reducing

the cost

for low resource or low requirement

use cases,

I can imagine s three vectors as being a viable option for more of a serverless style pattern of, I just wanna be able to throw something up, and then people can access it however they want, but I don't wanna have to deal with managing a fleet of servers to be able to manage their experience.

Tobias, that's, like, that's pretty much exactly where the team's head is with with all of these features is

we we want to make it as easy as possible

to work with data. And,

like, anywhere that we see friction on being able to build applications

or, like, adopt tooling that works with data is a place where we really spend our energy.

And there's so many I think it's been just so kind of awesome to see as validation of a bunch

of that sort of doctrine on things is as we introduce these little features like conditionals into s three or when we remove limits or when we introduce these primitives like like tables and vectors, we see it open up

a bunch of of use cases. And, like, on tables, for example, I've had a ton of conversations with startups that are like, oh, this is actually super helpful for us because now we don't have to do the undifferentiated

work of building this, like, scalable persistence back end. We can totally focus on, you know, building a really, really high performance tier or really building, like, differentiating domain specific, you know, functionality

around the data that ends up being ultimately stored down there. And so for all of this stuff, I'm really trying to work with the team, and we're really trying to build a a sort of, like, best, you know, base foundation for working with data that, exactly like you say, lets you turn on whatever tool you need. It makes it really easy for folks to build those tools that that bring, like, domain specific value to the data.

And as you have

been building the s three vectors and s three tables functionality with your team, recognizing that vectors is still very early and likely hasn't seen a huge amount of broad adoption yet. What are some of the most interesting or innovative or unexpected ways that you have seen one or both of those capabilities applied?

On vectors, I mentioned a few, I think, with the some of the, like, medical and finance,

applications. The vectors thing is if you're interested, you know, or your your your audience is is interested in, like, just completely

nerding out in an area of, like, computer science and information theory. Go Google or use a chatbot or whatever to summarize, like, recent results in in vectors. There are so many cool ways that these things are being used. It's actually quite surprising. One example that's really interesting in there is,

is there's, like, there's kind of a fun bit of work in visualization

that can take these super high dimensional vectors

and project it down onto two dimensions, which means that you can, like, view it inside a browser.

And you get these, like, clusters,

and there are some really cool, like, visualizations that allow you to take, you know, like a pile of news articles, for example, generate embeddings, do a projection down into two d and see, like, topics

emerge

inside there. Then in some of the visualization tools, you can click through into individual articles.

There's stuff like that. There's similar examples where I've seen folks on the data science side

using vectors

and especially using

outlier detection

inside the vector space. So, like, looking for the individual vectors that don't cluster well as as things that represent poor data quality. Right? Like, they they go through and and go, like, you know, these outliers

are either, like, broken

results

or they're results where I don't have enough samples. And so even even that kind of application

is really, really interesting to,

to see.

And

as you have been building these

features,

working with the overall s three

platform and teams, what are some of the most interesting or unexpected or challenging lessons that you learned in the process of bringing these capabilities

to production?

I I don't know if I have any new lessons on this. It is, it is a lot of work to ship software, especially with the,

the expectations that that s three customers

bring to something that, is is part of s three. And so, you know, for for

for all of these things, I I always feel like we work like crazy

to to get a launch and a feature that is, like, just really, really simple. And getting to simplicity is a is a mountain of work. And this one's obviously still a preview, and it's still early, but the team's just doing, like, just a mountain of work on on all of the aspects of running,

you know, s three scale storage service that are

or storage feature that are

that are just needed to keep, you know, like, all of our durability commitments and availability and and everything else. And so I I don't think that there's, like, a specific thing in there. It's it's all of the things that are that are required to get to the kind of operational posture that we that we expect for these.

And for people who are considering

or

evaluating

an

implementation of iceberg

cataloging

and maintenance

or vector storage and retrieval, what are the cases where either s three tables or s three vectors are the wrong choice?

Oh, that's a good question. Well, I think I mean, on the vector side, we kinda talked about it a bunch. Right? On the on the vector side, I think if you if you need a so I guess there's probably two things. If if you need

super duper

low latency

or very, very high transaction rates, query rates against the vector store, SV vectors is probably not the thing for you. It's great for most workloads, I think. But but I think for for bursty workloads requiring really high performance,

you're probably a lot better off to work with a dedicated provisioned in memory store. Tables is a little bit more of a complicated

answer in that Iceberg itself, I think, is is still a little bit early. And so I think,

on tables,

there's, like, a different answer

between

the sort of, like, analytics

use cases

and what the team wants us three tables to be.

And what I mean by that is Iceberg has existed for, I don't even know what, like seven or eight years at this point. It's it's been around for ages. It's really gained a ton of popularity in the last three or so years.

It's well integrated

into big open source analytics frameworks like Spark, but also, you know, it's supported by things like,

like DAFT and all the various Iceberg

accessors and and other, like, sort of emerging run times. I think on that side of things, Iceberg is in is in pretty good shape. And if you're if you're running a large spark cluster or you're thinking about building a data lake, s three tables is a better Iceberg experience than running it on your own. It has better performance than running Iceberg on its own on s three, and I think it's it's in a really good spot. The place that we aspire to take s three tables is that it should be a programmatic

table construct in the same way that objects are. And right now, the sort of, like, API surface for that, some of the workload surface, like, loads and loads of really small updates from lots of clients,

are things that Iceberg itself is not all the way there yet. It's it's like an area that the community is is working to improve and that we're also working to improve. And so I I think that there's probably a ways to go on s three tables being like a first class programmatic primitive as a table that you just use when you're building an application, but that's where we wanna take it. And as you continue to build and iterate on the s three tables and vectors features, you've mentioned a few of the forward looking capabilities

that you're looking to implement. I'm wondering if there are any particular

projects or problem areas that you're excited to explore.

There's loads there's there's always so much more stuff to explore than we have time to, to look at. I I think I mean, Tobias, on on this stuff, you start building these these features, and, you know, it it's so fun to steer s three in these new directions. And you find yourself, like, refining and refining and refining to just get the absolute core of the thing out so that you can learn from folks and so that you're not, like, getting caught guessing wrong on shipping a bigger thing than folks can use. And then they get it. And then immediately,

you know, the customers that that, like, try it and hopefully love it, call out all of the things that you have, like, like, not built. And

usually, the first tranche of all of those things are filling

in the set of s three features, you know, which is, like, a pretty rich set of features. S three is almost 20 years old, almost 19 years old, I guess, at this point. And so, you know, it's filling in all of the stuff that you mentioned earlier, like life cycle policy and cross region replication and and all of that stuff. And so I think for both tables and vectors, there's a ton of effort that the teams will fall into in terms of making it like a first class, like, rich primitive that supports all of the ways that people use objects. And at the same time, I suspect that we'll do a lot of feature specific work, but that stuff will will will will likely roll out a little bit more slowly given the focus on just filling things out right now.

Are there any other aspects of the work that you've done on s three tables and s three vectors or the applications thereof and their overall position within the ecosystem that we didn't discuss yet that you'd like to cover before we close out the show? We talked about a lot of stuff. No. I think I I think that's about it. I'd I'd I'd love for folks to kick the tires on these, and, I'd especially love to hear the stuff that we're getting wrong.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing and give you some of that feedback you requested, It'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Wow. That's that's a a good parting one. I don't know, Tobias. I I mean, I don't think it's a single thing. I think the reality is that,

that until we're at a point where you can just ask questions and and where it is, like, you know, like zero seconds to being able to get value out of your data, we're probably not done any of this stuff. And the reality is that that we've come a long way, especially, I think, over the past, like,

five years or so on on being able to put folks in a spot where they can just get down to working with data and building. But there's still a bunch of work that that we put on to developers that that is outside of what they wanna get done. And so I think, you know, it's it's probably a bit of a dodge,

as an answer, but it's actually the way that that that I kinda

think about the space. The solution to every problem is another layer of abstraction.

Alright. Well, thank you very much for taking the time today to join me and share the work that you and the team have been doing on s three tables and s three vectors. It's definitely great to see those functionalities

added to such a bedrock of the overall computing ecosystem. So I appreciate all of the time and energy that you and the rest of the team have put into that, and I hope you enjoy the rest of your day. You too. Thanks for having me.

Thank you for listening, and don't forget to check out our other shows.

Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI

systems. Visit the site to subscribe to the show and sign up for the mailing list and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com

with your story.

Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast