Powering Vector Search With Real Time And Incremental Vector Indexes

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SAS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/

rudderstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date.

With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics,

personalization and segmentation,

or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to data engineering podcast.com/materialize

today to get 2 weeks free.

This episode is brought to you by HEX. If you're a data person, you probably have to jump between different tools to run queries, build visualizations,

write Python, and send around a lot of spreadsheets and CSV files.

HEX brings everything together.

Its powerful notebook UI lets you analyze data in SQL, Python, or no code in any combination and work together with live multiplayer and version control.

And now, Hex's magical AI tools can generate queries and code, create visualizations,

and even kick start a whole analysis for you, all from natural language prompts.

It's like having an analytics copilot built right into where you're already doing your work.

Then when you're ready to share, you can use HEX's drag and drop app builder to configure beautiful reports or dashboards that anyone can use.

Join the hundreds of data teams like Notion, AllTrails,

Loom, Mixpanel, and Algolia using HEX everyday to make their work more impactful.

Sign up today at dataengineeringpodcast.com/hex

to get a 30 day free trial of the HEX team plan.

Your host is Tobias Macy, and today, I'm interviewing Louis Brandy about building vector indexes in real time for analytics and AI applications. So, Louis, can you start by introducing yourself?

Hi. So first, thanks for having me. My name is Louis Brandy.

I am the VP of engineering at Rockset. I've been here for for 2 years and change. Prior to that, I was at Facebook for 10, 11 years, doing infrastructure stuff and,

spam fighting for for many of those years, specifically infra for spam fighting.

This turned out to be surprisingly useful in my current role because

we we have wandered into this real time space AI and

and vectors are emerging. So it's part of what we're gonna talk about today and this is what Rockset is is well designed to do, and so I have some of my old things coming back with my new things,

and I think that's what we're here to talk about. So, yeah, that's the that's the very short version. I'm happy to give you a longer version if you'd like but I think that's enough for now. And do you remember how you first got started working in data? So

I came into data backwards, like, this isn't this wasn't a thing I decided to do entirely. It actually started with with the spam fighting stuff many years ago. When you're doing spam fighting, a lot of data problems are very real. It's super important that you have

data, like, but the the requirements are very different. So for example, in spam fighting, I would much rather have, like, 1 second old data instead of 10 second or a minute old, but I can lose some. It's okay to lose a little bit because I'm fighting spam. If I kill 99% of it, that's good enough. Most data systems, this is not like, most data systems will not make this trade off, like, lose a little bit of data but be a lot faster. That's not a thing that most data systems can tolerate.

So we we were building a lot of custom data infrastructure to meet the needs, and that got got us into it. Like, we then we're all of a sudden, like, learning about how you how people do this, what what what kinds of systems do you do you build or have been built. And you know I was mostly an infrastructure person at the time so I mean data and for obvious are very tightly related but not not entirely. So I spent a lot of time doing distributed system stuff and like even c plus plus stuff, like, I did a lot of core libraries type work, RPC type things, even build system and and things like that at times. But data was very important because

in a lot of these, obviously,

infrastructure and the data infrastructure is like I know the most important distributed systems at these large places are data systems, and so they they are inextricably linked. So that's that's how it got started. And then obviously when I came to Rockset, I was much more of a I still am, I consider myself more of like a distributed systems person than a data person from from my background, like, we have, like, proper database experts here. I I'm allowed to say that because they they sit over here and they they, like, really know their stuff. I do my best

to hang with the with the proper database experts, but even still, sometimes it's I'm I'm still catching up to some degree in the in that world. Yeah. Well, most of data engineering these days is just stitching together different distributed systems and hoping you know enough about how they work to actually make some sort of useful pipeline out of it. Exactly. That's we're all in the same boat together to some degree. And so in that context, I'm wondering if you can just start by giving an overview about what vector search is and some of the ways that it differs from other aspects of search that people are likely familiar with. So we could give a really long answer to this question. I will try to I will try to give a short answer maybe it's worth diving into some of the details. At a very high level,

vector search is about

it's fundamentally is in 2 stages. 1 is like

turning a very unstructured search problem. Hey. Find me things that I'm interested in into a structured search. And the structure here is, specifically, you project whatever unstructured question you're asking into a vector space, and then the structured question you're now asking is, given this input vector, what's the closest vectors

to that vector? That is the search. That is what I've defined the search to be. What are the nearest neighbors

to this vector in this vector space? And how you generate this vector space? Like, how do I go from something like all the restaurants in the world and some data about me to you answering the question of project me into the vector space of restaurants and find the nearest restaurants to me in that space, not physically to me, although physical distance would matter, is that's like a gigantic problem. I mean, that's literally what to a non trivial degree, what is like, what is being revolutionized at the moment is this precise problem. Like, the OpenAI's of the world are revolutionizing the quality with which we can embed a particular unstructured piece of data in a vector space. So, you know, 15 years ago, that's how we're changing it. Let's say, 12 years ago, I was doing this kind of stuff at at Facebook. We were we were doing, like, text based vector embeddings to do, for example, like, spam detection. Like, show

me show me stuff that looks sketchy based on a previous definition of sketchy. We were doing vector similarity, but our vectors weren't very good. They they were good. It was actually quite surprising how effective they were, but the vectors of today, I suspect, are much better.

Much, much, much better. That vector space, so to speak, is a far better,

the distance metric in that space is semantically valuable in a way ours weren't. And again there's a lot of research behind this idea of semantic value, like what does it mean for something to even be close in some space? And so

this is the core idea of a vector search it is in some sense the vector it becomes the lingua franca of these 2 sides. So getting from your very hard unstructured problem into vectors

is like a kind of an MLAI type of problem. It doesn't have to be, but typically is is where where we thought of this. Once you have vectors, well now it's like algorithm. We're in algorithms class now. Now it's like, alright, I have a 1, 000, 000, 000 vectors.

The distance metric in this space is, let's say, Euclidean. It doesn't have to be, but let's say it's Euclidean. This is like a bad interview question at this point. Like, how are you going to find the closest 1 to the input vector? It's a super well studied problem, and within that problem there's a bunch of other known hard technical problems which we could talk about. We actually think we maybe we should talk about. And so, a vector database's job is to do this for you efficiently, performantly,

cost effectively,

accurately. And

because of the increasing value on the left hand side of this equation, meaning the generation of vectors

is extremely become extremely good, meaning it's preserving a lot of semantic value in these vectors. The corresponding problem of vector search is that's what extracts value from the left hand side. So that's becoming an extremely hot area for people. So there's another

intersection here which is that Rockset is a real time data platform that's its its roots are in real time data analytics. There are different systems that exist in the world like like Rockset,

but the the architecture of those systems is very similar to what you want for something like a vector database. So it turns out these 2 somewhat disparate

areas are actually architecturally quite similar under the hood, and so this is why

certain kinds of databases are can quite naturally add vector search as an extension to what they're already doing because they're already kind of built to do this well. And again, this is another topic worth we could fill we could fill hours on this if we if we wanted to go into the details, but this is why Roxette is sort of getting into this into this space is because it's actually a natural extension of what of what we're we're already doing. So okay. So that is an overview of of the vector area. I don't know if that totally answered your question. I don't know if there's a lot. How much we wanna double click on anything I said? No. That's definitely a good framing for the conversation, and I've had conversations in other episodes about vector databases and some of these challenges about kind of generating those vector embeddings. But, yeah, to to your point, it's largely the data engineering question of how do you get this source data into this other representation,

which a lot of times does require some sort of ML algorithm to build those vector representations.

And then it's, as you said, linear linear algebra is just a matter of how you want to represent that distance metric. Digging more into the technical aspect of

building a vector search

capacity for a particular problem space. I'm wondering if he can talk to some of the technical challenges that are associated

with the end to end workflow. So, what we talked about a little bit already of generating those vector embeddings, having that vector database,

having clients that are able to

understand those vector representations for being able to apply those search interfaces,

and just kind of what that means to have vector search as a capability within an overall,

either data pipeline or data platform or data application,

particularly in terms of how that would

with maybe a, full text search that people might already be familiar with?

We it's useful to ground this in an example. So text search is a good is a good 1 that that you can use. So typically when we people are probably familiar with like full text search or keyword search or these kinds of like find me all the documents that contain the following string kind of kind of queries on a on a text corpus. The idea where vector search becomes valuable is you you would in theory, you didn't you would have a very smart ML model that you've trained, which is beyond our conversation. Right? And and its job is to to turn your text documents into this, embed them in this vector space. And then typically when I I ask a query, I can then go search documents that are like it, and again, like it is this typically framed as like a semantic search, you know, hey. Show me all the documents about, you know, launching, you know, satellites. And it's like, well, it has some notion of what that means, and it's able to find documents like that. It's this idea of semantic

search. It's interesting because you mentioned technical

problems, and I think 1 of the if you're really early in this, you can imagine thinking this and being like, oh, I can see how that could be useful to have this kind of, like, fuzzy, semantical type search instead of this very exact sort of substring based search. Anyone who's actually tried to do this in real life, I think very quickly realizes you run into some very specific problems very quickly, like super fast it'll happen, as you try to turn something like this on. The first and most important 1 is these are independently useful, and you actually almost always want both. You almost always want, like, hey, show me documents involving satellites that contain the specific keyword, whatever. I don't I don't know if you want satellites, you know. I don't know. Sputnik. I don't know. Right? You you want these, like, hard guardrails, like, in the in the restaurant case, like, hey. Give me restaurants within a 100 miles of me that are not expensive, that I might like. The I might like part is this vector y kind of semantic, in this case, search, but all these other things are like very concrete metadata that I really, from the product experience perspective, are sort of non negotiable. Like if I ask you to give me restaurants within a 100 miles, I don't want something 200 miles away, like that's weird. Even if the vector says I would love that restaurant, it's a weird product experience.

And so joining this these 2 prop these 2 together, these 2 kinds of searches together, it's actually a very hard problem. And so a lot of people will get hung up on vector search as this, like, okay, cool, I have this cool semantic search capability, but sort of naked vector search is almost never

great on its own. Sometimes it is very good, but most use cases need to combine these kinds of searches together. And this is a really hard problem in a vector database

setting, or or I should say database settings in general. But but but this metadata filtering problem is really important, and it's a very difficult 1 to solve. And I actually think it's

underestimated how important this will almost always be in a vector search use case. So that's my 1 big problem that you'll run into. The second really hard technical challenge when you're when you're doing vector search is incremental indexing of vectors. So the other thing that people really have to understand about vector indexes is they're very monolithic

creatures by nature. At least as of today, meaning

I give you a 1000000 vectors you work really really really really hard to make them searchable,

and if you give me 1 more there's no good way for me to add it to the index in a way that preserves the index's fast lookup capabilities. Obviously, I can add 1. I can add 1, it won't it won't deteriorate too quickly. But if I'm constantly streaming new vectors into this thing, it's it deteriorates fairly rapidly

and and loses its fast lookup properties.

And so if you're in a world where

I can train this thing every night and, you know, yes, I can train you on, you know, I can give you vectors from yesterday because I trained it overnight. We're we're great. Things are good. So I always use the example of Spotify, like, music doesn't come out that quickly. So if I if I if I train a massive vector thing overnight

for new music, then then that that's gonna work pretty well. But if it's something much more

rapid that's happening,

it's a very, very difficult problem to keep your vector index both fast and up to date at the same time. So these are my 2 very hard problems that I think

vector solutions

frankly, there's more hard problems because you you you actually mentioned something very important, life cycle. There's actually a third hard problem here. We're too zoomed in that there's a there's a if you zoom out, there's a life cycle problem of like, okay, great. I have a 1, 000, 000, 000 vectors.

I got a new model now. How do I update this this gigantic thing without overloading my database or whatever? Like, so the the the life cycle management of your vectors, I actually think, is another very hard kind of data management problem. These are kind of, I think, the big 3 maybe of like hard problems that we as data people

need to take much more seriously and do a much better job for our collective users.

Data the data in for people as opposed to the data users. All 3 of these are very hard problems. So these are problems we are taking super seriously. I mean, other people are also trying to to make progress on them, but I I would encourage anyone that is, like, interested in, like, building in a vector search capability

to think about ask these questions of whatever solution you're thinking. Like, what what are my needs for incremental indexing? What are my needs for metadata filtering

and life cycle, like vector life cycle? Think think about ask those questions at the front. Obviously, you may not get good answers, but that's okay. You you want those answers now rather than to not ask those questions.

And there are a number of more detailed aspects of this that I really wanna dig into, but I'm gonna delay them for a little bit in favor of the question about what are the cases where

vector search is such a high value ad that it is worth all of the added complexity that we were just discussing?

So, yeah, there's a good question. I think the historical number 1 dead center use case for vectors has been recommendation type system. So semantic search, the text search we just said is is a kind of recommendation engine. It's like, I give you this query recommend things like this, it's if you squint at it it's a recommendation problem. But even more classically things like,

you know, people you may know on Facebook

or hey, you should check out the the right pane on YouTube. Like, hey, these are videos you might want to click on. Like that those kinds of features historically have been very

amenable to machine learning techniques in general, but vectors in particular.

So, you know, every user is a vector in theory, every streamer or YouTube video is a vector in some sense, and so these recommendation engines being powered by my vectors, a very classic use case. Amazon, another very good use case. Hey, people that bought this also buy this other thing.

This is a dead center

vector type use case that, to be clear, hasn't always been done with vectors. We, again, I we could this is another rabbit hole. You can build a bespoke classifier for all of these systems and people often do and have, so there are different ways of approaching that but that kind of thing is very much the kind of thing that vectors have historically been

very good for. We should point out just for like for completeness,

low dimensional vectors have been also very useful for a really long time, it's typically not what we're talking about when we talk about vector search, so low dimension like even very low dimensional vector like geo geo data, so to find restaurants near me is a vector search problem, but it's in low dimension, 2 dimensional vector search. So those are there's also a huge universe of of stuff to talk about in in that particular in that particular vein, but I think those those are the 2 classic pieces. Obviously, there's a new 1 that just just occurred in the last year which is this kind of LLM model, so a lot of these large like chat bots chat bots have if you've ever, I don't know how many people have cracked them open and see how they actually work. There's a very large vector component

to how chatbots work. So as a very simple example,

you can you can imagine training a chat gpt like thing on your internal documents so that it can answer questions about your your own internal business. You can say like, hey, you know, when I I don't know. You know, when I use this library, what what happens? And and your little internal chat, you could tell you

about it.

But inside under the hood, that same question, it gets turned into a vector and could return,

like, as a semantic search, the results. Like, here are the internal documents that would answer your question were you to read them. And that's kind of how the chat gpt kind of thing would work. It sort of says, like, hey,

do this vector search, find these documents, and then summarize this for the user who asked this question. That's kind of what the the robot is doing under the hood,

and so that's much more expensive than just returning the documents to the user. It's computationally more expensive, but this is like an emerging place where where vector search is actually either a partial solution to the thing you wanted to build or or or part of the implementation of the thing you you wanted to build. So so people that are building chat bots are all kind of looking at vector

vector search and vector database

technologies to help power those things. So those are I think those are the main

I I shouldn't say the main, but at least the most common use cases we're seeing. Yeah. With the increasing attention in large language models and applications built on top of them, I've been hearing vector search and vector databases come up more frequently than over the past couple of years, which is around when I first started hearing about vector databases to begin with. And to that point, is it a requirement for being able to apply these vector search or vector index capabilities to have a dedicated

piece of database software that is focused on these vector indexes, or is it something where you are able to take advantage of these nearest neighbor searches for vectors without necessarily having to add a dedicated piece of infrastructure for this specific problem where it's something that can be encompassed with other technologies that might already be part of your stack? So I think you've asked the most important question of all, because

I have very strong opinions here and

I have I also have I have many hats that I can approach this question from. So obviously, I'm the VP Engineering at database company, so I'm gonna have opinions about this. But even in a former life, like at Facebook, like that,

nobody wants to run another thing. Like, that's like, that the the the cost of running another thing is always super high. You could have a really good reason to run a new thing, whatever the thing. I think there's an even more pernicious problem here that is actually

maybe the thing I think the most, which is

a vector database

is much more database than vector. Like, if I in real life, if you're gonna run a real vector database that's gonna solve a real production issue, most of the things you need from it and the problems you're gonna have with it won't have anything to do with the vectors. How is this thing,

you know, managing its memory? How is this thing what is its durability? How are back how do backups work? How does access control work? How does atom atomicity work? And,

like, how does you know, what is the consistency model of this thing? Like, these are all, like, database problems. This is like here take this database textbook, get to work, at the very very end we'll implement some cool vector search algorithms, but do we have a mountain of database problems to solve, Like, there's no denying it. And

so

to me, like, I don't actually think that sit here. I'll make a very controversial claim for you. I don't think vector databases are a thing in the sense that there are databases. There it's just it's a database. Like, it's in the end, all vector databases will be databases because that's

it's just another index in a database, and all the database problems are there's still a mountain of them. And so whether or not it's better to add vectors

to an existing database or whether you should add a database to some vector search algorithms,

I think it's much easier

on average to add a vector index to an existing database than it is to,

like, build a database around a vector search algorithm. Obviously, that's gonna depend on a particular

problem and what your actual requirements are. I think most vector search use cases can get away with just having a vector index in an existing database, not not building a new thing, not adding a new thing. This gets us into, like, a slightly more

treacherous waters, which is, like, well,

real time analytic databases are much better for vector search than, for example, an old school. I shouldn't say old school, like that's actually a terrible word. Like a transactional database, you still need 1 of those. It's not old school. You should have 1 of those. You need

transactions almost certainly in what you're building. Those databases though are architecturally not quite as good for doing like high like high performance and efficient vector search, and it's the same reason

analytic databases like OLAP style databases are different from OLTP style databases. It's exactly the same architectural difference is what makes those analytic databases good for vector search and transactional databases not quite as good, but obviously you can still do it. It's just as a general rule, I don't think you want your vector search going to the same database processing your transactions because for many reasons, but, like, the first and obvious 1 is you're gonna DOS yourself. Like, you don't but the transactions are kind of actually more important than everything else. Like, it's more important that Amazon bills you properly than it recommends smart purchases right for you. And so there is this separation. Right? OLTP, OLAP data sorry. Yeah. OLTP and OLAP databases.

They were driven in that that those 2 architectural

directions for a reason and it's actually the same reason to a large degree why Vector Search ends up being really good in this kind of analytic type database

environment.

So my opinion

is OLAP style databases adding vector indexes is an extremely good way

forward, and I think in the end it converges there. Vector databases will look a lot like OLAP databases

and OLAP databases will add vector search capabilities,

and these will almost always be paired with an OLTP database. So I'll have, like, Postgres over here and maybe Rockset over here, and, like, that's actually the data stack for, again an abstract problem we haven't defined, but I think that's actually like

kind of the path forward. And at my suspicion is there won't be a third category called vector database. They will just be vector

heavy OLAP databases

is is kind of how this will look over time.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses,

but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date.

With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization

and segmentation,

or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to data engineering podcast.com/materialize

today to get 2 weeks free.

For teams that have decided that it is worth their time and effort to incorporate these vector search capabilities into

whatever problem that they're solving,

who do you see as the typical owners of the data flows, the data modeling,

the system architecture around these vector capabilities? Is it usually the data engineer, the data scientist, the ML engineer,

data architect? I'm just wondering what are the ways that incorporation

of these vector embeddings, vector search

influence

the team topologies and how they approach that problem?

Okay. What a great question. Yeah. I,

I don't

know, but in a not in the simple sense of of the word of the phrase.

I think the way this ends up looking is very similar to how some of this stuff looks today with how you might imagine an OLAP database or an Elastic search cluster being run-in a

large enterprise. So it's typically like a data infra type team or data engineering team that is like owning this thing. There's a bunch of other teams, data science and and things like that using it,

sometimes abusing it. Everyone knows that those systems get abused and basically you talk to any team that owns 1 of these systems, they'll tell you all the ways that other teams are being clever and abusing it.

I think Vector Search fits very squarely into that space in your brain. For all the same reasons that you might put something like an OLAP database into your into your stack, I think Vector Search ends up fitting in a very similar I I think that there's a good analogy, like, when you think about team topologies to think of it just like that. The same reasons you might add an OLAP database or something like Elasticsearch,

I think that that's roughly where a vector search will will fit in. So so typically, it's gonna get run by a

core

engineering team that that owns it and other many other teams will use it if it's widespread.

Obviously, the other the other way that it happens will be the team that wants to build the feature will set up their own infra,

and maybe they'll it's bespoke to that 1 team and you'll that kind of setup happens every time, but that that ends up being gnarly over time. Like, it's always dangerous when you a team builds a piece of infra for the feature that they that they want. So I I don't know if that answers your question. I I think this is, to some degree, still emerging rapidly, but my view on this is that the data it's like data infra type of an organization will own

the this feature. It will be part of the core data stack. It will be a primary feature of 1 of your core pieces of infra over time

and then everyone else will use it and possibly abuse it. The other interesting aspect of this is the data modeling question where

you have in transactional databases and tabular resources, you have a very familiar syntax of columns and rows. You know what they're supposed to mean. When you start to deal with the vectors, they're a very abstract concept. It's just a mathematical structure with

different values within a matrix. And unless you are very deeply invested in the problem domain and what it is that's being represented, if you look at it, it's just a bunch of gobbledygook. You have no idea what it's actually supposed to try and tell you. And I'm curious, what are some of the ways that you've seen teams try to

apply meaning and context to these vector records so that somebody who's looking at the ways that they're generated or trying to understand

kind of what the meaning is supposed to be semantically

for this particular search problem that they're trying to solve for, how they can start to think about either adding appropriate metadata, but also in particular, as the vector representation evolves, how you think about that migration process?

So the way I think about vectors

is as a pile of gobbledygook. I actually think that's the most healthy way to think about them. So so in a in a tabular form, there is a column called my vector embedding,

and it is a bunch of numbers that is basically meaningless except for the operations you can perform on it. And so I actually think thinking of it as a tabular form is the most healthy way to do it. So I have

an image table and, you know, here's the source image, here's the uploader, here's the time, you know, my classic you can invent your your database schema for

this. And but but over on there in the end, like 1 of the last columns will be the vector embedding, you know, for whatever kind of model generated this. Right?

And so you may have more than 1 because you may have more than 1

model that is embedding it for different definitions of semantic

relevance. Right? So I get I mean, a very simple example is like the kind of model I would use in spam fighting is very different than the 1 I would use for example in image search. Like, show me images of elephants.

I want a very very specific kind of model that understands elephants, but in spam fighting I actually care more often about

near duplication. I don't actually care about images of elements. I want things that are that are fairly similar, like, almost ident near identical. I'm looking for things like I I want it to be like resistant to rotation or cropping or height balance changes, things like that. That's a very different notion of closeness

when we think when we when we're talking about semantics.

So you may have more than 1 column that is a vector, but to me that like so to be concrete, like you and I know is a bunch of numbers, so maybe you might think I should put I should have a bunch of columns that represent a vector, but I but I would say no. Because there's no

relational

value.

I mean, I shouldn't say that. I think it's very healthy to not assume there's any relational value within a vector. But you'd never wanna write a query that's, like, show me all the vectors where the 15th dimension is above 42, or the the 23rd number is less than the 47th number, like, that's not a thing that I know of that is gonna ever be valuable. So as a general rule, I think of a vector as just a new column in your database that you have built an index on just like you would any other column, except there's a very specific kind of index that has some of these peculiar properties, for example. It's very hard to incrementally update. And this model also highlights this whole metadata filtering problem. Like, it gets you back to this very hard problem that we talked about. It makes it very natural to wanna talk in these terms. Right? Like, when I say show me all the restaurants like this, what I'm saying is show me all the restaurants, like, order by this vector column, distance in this vector column is my SQL. I'd order by that limit 10.

Where, you know, my where clause, you know, where the, you know, it's Italian. So I I think that's the right way to model this and it and it it fits naturally into

a, you know, a SQL type mindset

and it

it also,

you know, the it it it reduces to language we understand. Right? You have this where clause that is your metadata filter,

Your order buying

typically distance,

and that's your vector search problem essentially, and that's 10 that tends to be how I think about it. Now there is a much harder problem of,

like, understanding these vectors. I that's like a open research problem of, like, what does this even mean? Like, what is the, you know, the what is the 15th what is the semantic meaning of the 15th dimension number? The fact that you're a 45 and I'm a 13 in dimension 15, what is there any meaning to that? As far as I know, this is like an extremely unsolved problem. It may be unsolvable.

And so I I think from a vector database and data you know, the data users perspective,

I would think of it as gobbledygook. I think that's actually the right way to think about it. Is it is a a bunch of numbers that you can perform functions on that that that give you semantic meaning. In terms of

the consumption side of things for

relational databases, tabular references, there's a large array of

software and libraries and technologies that understand how to work with that data. I'm wondering what you are seeing as the current state of the ecosystem

for

clients that are dealing with these vector representations

for being able to understand how to structure the queries, how to interpret the results that are given back, some of the ways that that factors into the overall

system design of the data applications that are interfacing with these vector search capabilities.

Okay. You've asked a good question. You made me think a little bit here. The the the traditional databases have this advantage that we talk to them in a relatively universal way and they give us an answer in a relatively universal way. So you could build tools

that, you know, you could build like there are a lot of libraries, like great visualization libraries, let me explore MySQL database relatively

straightforward way. What is the equivalent for that in a vector world? I mean, the answer I think is clearly there is none at the moment. And it's in part because, like, a lot of our first principles aren't even the same, like, you don't talk to vector search in a universal way at all today. We are we are building it like SQL. So for us, it's trying to fit into this existing model where if you if you have a SQL

if you have a tool that knows how to interrogate via SQL and visualize something in an interesting way, I wanna make that work for us, you know, in sort of that paradigm. But there are other like, some of these some of these tools don't even

use SQL. Like, it's an API call.

And

in in a generic sense, there is no generic sense in which you can interrogate 1 of these things to get answers. And then certainly, it will return you answers in a very different way. Obviously,

in a simple sense, it's gonna return you vectors, but typically,

there's metadata associated with those vectors. So there's, like, some either you have data there or you might wanna do something like a join, more equivalent of a join on the answers that come back. And so so now everything becomes very bespoke. So

we are trying really hard to make this a conventional player in the ecosystem. Like, if you think of vectors as simply data

that you can that's in your database, that you could join with other tables, and that you can perform searches on, and that you have indexes on. It will play nicely with everything else that has been built above and around it, but that is not by no means

the emerging standard at the moment. I don't think there's a standard at at at the moment. So I think the correct answer to how are you gonna

do this kind of stuff is gonna depend exactly and precisely on the tool you happen to choose today,

and it will be easier or hard depending on the specifics because I don't think we've standardized

this at all. I think some of our, like, first principles like our our our, you know, we don't have lingua franca entirely established here and so I think things are messy at the moment.

And in your work of

building

a

database technology that is aimed at supporting this vector search capability.

What are some of the aspects of customer education,

logistical challenges, logical challenges that you're seeing teams try to tackle in the process of adopting these capabilities,

integrating them into their data stack, adding them as a feature to their existing

suite of data platforms and data technologies?

Yeah. So

logistical challenges. So there's a couple ways I have interpreted that the the question. So the first thing that comes to mind is 1 of the things we're seeing a lot of is a bunch of businesses

and teams that, like, know something is happening

but don't understand,

does this does this apply to us? Does this am I like, are are my competitors doing about to do something that I don't even know what's happening and destroy, like, destroy everything? Like, they're, like, trying to figure out, does this apply? Is this just something those crazy people over there are doing, or does this apply to what we're trying to do? So, like, finding a path to, like, in our particular business, the problems we actually face, is there a is there a place here for this, like, rapidly emerging technology?

So that's 1 area of customer education that I think is it's hard. I mean, we we're talking to lots of customers and there's a lot of brainstorming happening here about how you might use some of this stuff

in in a particular business venue. It's not always clear to me that there is a good use case and that's okay too obviously. So that's 1 area that's very hard for customer education. The other part that's interesting is we're pretending like it's 2 people talking but when 2 organizations talk it's actually a lot of different people talking, And so you might have like a data, like the team that's owning and operating the infrastructure versus the team that wants to use it. They have different needs and wants and hopes and dreams. We actually covered some of this earlier right where it's like it might the product teams might think this is useful, but the the the infra teams are like, okay, but I don't wanna run another thing unless you tell me this is really gonna be useful. So there's this part is also quite hard typically the team that is sort of wanting to build and the team that has to own this thing and run it and pay for it so to speak, isn't the 1 who necessarily wants it. They don't they that that's not a thing that's gonna power something important to them.

So understanding, like, what the real business value is here and getting a a business to understand that value, I think is also very important in this kind of because it's complex, it's emerging, it's not obvious how this fits in to an organization.

And then again, there's another meaning of the word logistical which I kind of latched on to which is like the sheer problem of managing this is

is large. Meaning,

you know, if I have a 1, 000, 000 vectors that's probably not too bad, if I have 10, 000, 000 or a 1, 000, 000, 000, it's a lot of data, and then I need to update it, I need to keep track of it, I need to, like, I need to, like, I need to, like, like, imagine I have to run this really expensive model, a new version of it on a 1, 000, 000, 000 vectors. That's like a huge compute job that needs to get orchestrated. That's a massive piece of infrastructure that has to, you know, and typically your vector database doesn't solve this for you, meaning

I don't you don't wanna run a massive retraining job on your database because that's gonna that's gonna DOS whatever it's running in in real life.

So you have to come up with some

there's a massive logistical problem here of just, like, how do I how do I retrain this thing? How do I update this thing? How do I how do I deal with with, you know, rerunning,

you know my model on a 1, 000, 000, 000 on a 1, 000, 000, 000 vectors and storing that and keeping my keeping my sight up while I while I do that.

And in your experience of working in this space, talking to customers,

applying these

technological capabilities

to the problem spaces where they make sense? What are the most interesting or innovative or unexpected ways that you've seen these vector search capabilities applied?

So my favorite use case

is

it's a traditional use case

but it incorporates a lot of things that are subtly hard and it's 1 of those good examples of

small details

matter a lot. They make the problem go

from easy and I'm using scare quotes on easy because the problem is not even the easy form of the problem is not that easy to very, very, very hard, very quickly when you use certain words. So we have a particular customer called Whatnot.

Whatnot does

live buying and selling. So, I mean, it is, you know, it is essentially a union of, like, eBay and Twitch.

The idea being at any given moment, you can go to Whatnot, and if you're interested in buying Pokemon cards, there's someone live trying to sell Pokemon cards, showing them to you. You can interact with them. You can ask questions. You can buy stuff from them. So it's sort of a live marketplace.

They have a classic

Vectorsearch

problem, like, hey,

I historically am into a, b, and c. Can you show me people that are that are selling a, b's, and c's, right?

That is a dead center vector search problem, but it has this very very difficult

real time online component.

Meaning, well, the user

just decided they're gonna today, they logged in, they're gonna sell Pokemon cards. Yesterday, they were selling baseball cards or whatever,

and so their vector is changed. It's a suddenly a new vector, right, because it used to be baseball, now it's Pokemon, and

similarly, they're online right now. That is a piece of metadata that is associated with, like, right this second as opposed to, like, they weren't online 5 minutes ago or an hour ago. And so they

have a very real time

component to their vector search. Remember, we start all the way at the beginning. I was like, incrementally indexing is hard. Like, it's not a thing you wanna do. Well, congratulations.

Because they have this live component,

it's incredibly important that their vector index is up to the minute or or sooner,

ideally,

fresh, and that their metadata is fresh. Right? So this, like, doesn't do me a lot of good to say, like, hey. Here's a streamer you would really love to watch, but they just went offline 2 minutes ago. Right? So I have to have real time

metadata filtering, and I have to have real time indexing, effectively real time indexing. And so those those 2 requirements

radically alter the question of vector search. It's a much much more technically

treacherous territory now because it doesn't it doesn't really do them any good to train some giant model overnight because

everything could be different tomorrow. I could log in at any moment and I could sell whatever I want, so I really need this thing to be nimble.

And and vector search indexes are not if there are many things but nimble nimbleness is not something that they have that that that is that comes naturally to them. So first of all, these kind of real time systems like Rocks so I mean, Rockset comes from a real time analytics world. So we've already been this is what we care about already. It doesn't matter if you have vectors or not. Like, the idea of, like, I need your data to be up to date when you search it like as fast as possible

is like a core sort of part of our infrastructure.

But that's when you start to get into like the really hard problems of Vector Search become very valuable from a product perspective, the product experience perspective, and so that's 1 of my favorite go to examples of where

you really are pushing the technology.

For example, you could not just download a random vector search

algorithm from the internet. You could find really good ones on GitHub like right now, but they are not gonna be good for incremental indexing, especially continuous incremental indexing. That's gonna break stuff. So we're we're we're into, like, proper science territory now where, like, you know, people will write papers and PhDs will be given for people who make this particular

part of

of, vector search better.

This episode is brought to you by DataFold, a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production.

DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools,

keeping your team productive and stakeholders happy.

DataFold integrates with DBT, the modern data stack, and seamlessly plugs in your data CI for team wide and automated testing.

If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.

Learn more about DataFold by visiting data engineering podcast.com/dataFold

today.

And in your own experience of working in this space

and keeping abreast of the changing technologies and the changing ways that it's used, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Yeah. I mean, I

at the risk of repeating myself, this incremental indexing problem is a plus hard. The metadata filtering problem is a plus hard.

1 of the 1 of the other challenges I I said a version of this earlier, but I will repeat it here.

There is a grave mistake that anyone out there could make. I encourage you not to make this mistake,

which is to not understand the difference between vector search and a vector database.

It's not I mean, it's not that hard to download a vector search library from the Internet and start playing around, and you can, like, you could build a really compelling prototype probably super fast, But it's a but you're you should take the next step very carefully.

And the next step being, oh, we should launch this because what you're gonna do by accident is, like, sign yourself up for building your own database,

like, slowly and surely. Like, very quickly, there'll be questions about, okay, what's the consistency of this thing, and what's the durability of rights to this thing, and and what's the data latency to this thing, etcetera etcetera. Access control. Right? I'm not telling you not to build your own database. I mean, I am telling you don't build your own database, but if you decide to build your own database at a minimum, you should do that consciously. You should say, we are gonna build our own data. We get I'll get a room, be like, alright. We're gonna build our own database. Like, don't just accidentally

wake up 1 day and go, uh-oh. We're we're we're building our own database. Like, I have a whole team of people worried about transactional consistency.

That seems like a that seems like we've gone wrong somewhere along the way. So I think that's kind of my my, like, main takeaway here.

I think for teams, like, this is more of my warning to to to to the people that are starting to go down this road is is there's a few early forks in the road that I want you to understand.

You can take the different paths. Just take them eyes wide open, instead of the other way around.

And for people who are

exploring these capabilities,

trying to get a grasp on vector search, how it works, how how to apply it to their problem. What are the cases where it's just absolutely the wrong choice?

Yeah. When is vector search the wrong choice? It's if you can't, I mean, it is all fundamentally

built on the idea that there exists a vector space on which a metric preserves some semantic value, and you have a way of embedding your things in that? There are absolutely cases where it's not so clear that a vector space with a distance metric can model the question you've asked. And therefore, this is a terrible

setup.

A lot of these are trivial in some sense. So for example,

keyword search is not a thing you should do with vectors because this does not map to a vector space particularly well and in fact, there's there's actually a good example of this that's come up recently which is like fuzzy string matching, right? So for example like it's very like imagine you want to do keyword matching but with with typos, like it's because humans typo, so like a type ahead. A type ahead is a very common thing that people wanna do, right, like I start to type in my thing and I wanted to give you some autocomplete or some suggestions,

and I want that to be like typo

aware. So if the user makes 1 or 2 typos, it should still suggest stuff that's valuable that that might be completions for this. This problem, as far as I know, is not amenable to vectors at all. There's no vector

space that where distance is like edit distance, for example. And so this doesn't work in vectors as far as I know. There are a lot of spaces like this where the the modeling you wanna do is doesn't work in in a vector space with a distance metric. And so I think all of for all of those use cases, you should avoid,

you should avoid backwards. The other thing I would say is that are relatively

expensive compared with traditional means. Meaning, my database record has maybe, like, a 2, 3, 400 bytes in it, historically, you know, depending, you know, for whatever I'm putting in there. But, like, an OpenAI

text embedding is like 1500

dimensions, so it's,

I don't know, several 1, 000 bytes if it's 4 bytes per dimension.

So just

any

normal query you would run is much more expensive when it's doing a bunch of vector stuff than it would be otherwise, and so there are probably use cases where, oh, this might be even be better with vectors, but it's 5 times more expensive

or or more, 10 times than, like, a trivial kind of SQL type query. And so maybe in that case as well, like, is the ROI worth it? And and there are probably many cases where it wouldn't be worth it at all. Like, it's computationally

a bad ROI.

I so I think that's another thing to consider when you're when you're thinking about vectors.

Is the

overall

adoption for vector search, vector indexing continues to increase, particularly with the

increased application of large language models. What are some of the

future potential applications that you see for these vector indexing, vector search capabilities that are yet to be realized?

I think that the correct answer to this is it's very hard to predict. And the reason I say that is because

the vector search is exactly as valuable

to my business

as

the value of the embeddings that we are searching.

So like the fact that we suddenly have a bunch of really high quality text embeddings

suddenly make something like a vector search for say chat bots or natural language prompt to as a as a search query. Suddenly, that's really attractive use case because the embeddings are so good.

So, you know, if if an equivalent embedding technology comes out for some other media,

suddenly that's gonna be super valuable use case tomorrow, and it will be almost a function of of of that. So in some ways, the

the limiting reagent in in in the future use cases is the value of the stream embedding and how well it preserves semantic meaning, not the vector search piece. I will say though that in kind of the AI language,

they talk about like multimodal.

And I kind of want to talk about like non modal, meaning like it's 1 thing to look at like an image and come up with an embedding, and then look at some text, come up with an embedding, look at a video, come up with some embedding. But like a lot of the problems that we are seeing does has no

it's virtually all the data of the universe is is the input to this to this thing. So I again, a trivial example here is like, I wanna do fraud detection for for a for payment processor.

What is the data that I might pass in to say, is this transaction fraud? Well, it could be 1 it's all the point in time metadata associated with that transaction. They wanna charge this much money, this is the vendor blah blah blah blah blah, but there's also like everything that's ever happened.

What is the entire history of this vendor? What is the entire history of this

region? What is the entire history of this of this payee, what is the entire history of like all of that, turn all of that into a vector,

put that into a vector space and tell me if this transaction looks shady or not, is near previously known shady vectors, for example.

That's the kind of thing where, again, the vector search problem is very straightforward. I don't know how to do that first part, Like, how to take the that particular transaction and everything that's ever happened, the entire history of, I guess, everything

before that transaction

and turn that into a vector in a sense you know, I mean, I can turn it into a vector, but is it any good? How do I make that vector better, that vector space better?

That's a place where I would like I would like us to make more progress. I would love to to be able to help users who have problems like that vectorize them.

I don't know how to do that down to and this is the kind of questions we're getting asked that I don't I don't necessarily

have great answers,

for today.

And that actually brings up another question that we didn't touch on yet is the question of data quality

in the context of vectors where

if you have tabular data, if there are numeric values in there, or if there are textual values, then you can say, this is the expected range of distribution that I should have for this column, or these values are outside of the usual

type. So I can do some sort of anomaly detection on these static or scalar values within this tabular construct. Or even if you're dealing with, unstructured data, if you have a JSON object, you can do some sort of anomaly detection there. I'm curious how how that concept of data quality management applies to vectors and if that is still something that is still yet to be determined.

So when you say data quality, typically what I think about is, like,

rogue data, like truly,

like the data is nonsensical.

It's not it's not that it's anomalous, it's it's nonsensical.

Meaning somebody typoed something or some bug has occurred. And it it's interesting because in a a vector space is like a dense

thing. Like, by definition, all values are valid.

Like, so in a sense it's very hard to tell that

something is truly wrong in the data. On the flip side,

it's easy to tell that something's anomalous, like you literally have built a space that is intending to cluster things. So if something is

anomalous by definition, if it's not close to anything else in this space,

and so that might be because it's truly unique. Like, the the original thing is fundamentally unique in an interesting way. It might mean something terrible has gone wrong along the way, meaning there's a bug and or some other some other, you know, quality issue has occurred.

I don't think there's any good I don't know of any interesting work in this particular dimension at moment. It's actually an interesting thing to think about.

I I am very interested in the problem of anomalous vectors, meaning,

imagine, you know, you're doing you're ingesting vectors all the time, and I wanna do what what we would call ingest transformations. I wanna add a new column. So when you ingest this row, say, hey. Here's all the metadata. Here's the vector. I wanna add a new column. It says, is anomalous? Yes or no? For some, you know, parameterized some way. And so every time a new vector comes in, I look at it, and I go, oh, man. This is really far from

everything else. So let's flag it. Yeah. This is anomalous. That's a very

powerful little feature. But typically, I think of this in terms of, like, the users use case. You know, again, I will use fraud transactions. Like, you have you're sending transactions to me. I we we build a vector embedding of them, and then I can flag anomalous ones based on some criteria or some metric. That's very interesting to me as a feature of these vector spaces sort of materializing

the notion of its distance at time of ingestion. I think it's actually a very

powerful potential feature and something I do think will become relatively standard, these kinds of ingestion time,

sort of transformations. But, yeah, in terms of quality, I actually don't I don't know. That's a great 1. I actually wanna I wanna write that down because I wanna actually go see if there's anything here that that maybe we should be thinking about.

Yeah. And and even just in terms of simple schema enforcement of all of these other vectors have this number of dimensions. This other vector has n dimensions plus 1 or n dimensions minus 1, so I'm actually not going to allow you to insert that because it will not, you know, it's invalid for this context. So this is the 1 thing we do. I mean, like, this is it. Like, so we have so Roxette has arrays as a type, and and by definition,

by our definition, arrays can be arbitrary lengths. So so in theory, an array is like a list basically, like so you could have an array of size 4, and then the next row can have an array of size 7 and so forth. That that's that's the you're allowed to model it that way and then we for us, a vector

is a fixed size array for a column. So we say, okay, this column has got to be vector, so blank 256.

So we will check that. That's but that's about the end of it because the problem is once you once you have at least that, everything is valid. Literally any you know, if it's 256

integers,

if any list of those is valid. So there's no, there isn't really sanitization you can do beyond that like I don't think there's any reasonable sense in which you can say like, oh, well dimension 16,

you know, should have the top the high order bits should be 0. I don't think any of that makes sense. Like, in some sense, vectors

embeddings are trying to maximize their use of that space. So in theory, everything should be valid within that space.

Are there any other aspects

of vector search, the ways that you're applying it at Rockset,

the challenges associated with building capabilities on top of that that we didn't discuss yet that you would like to cover before we close out the show? I guess I would just reiterate the point that I'm a huge believer

in

kind of both the incremental and real time components of vector as well as the metadata components, and that we are a strong believer in this this needs to all be built together and unified. Meaning,

I don't want to build a vector search that has metadata filter bolted onto it weirdly. I want to have a system where you can write a query that involves

like metadata filters and vector filters, and they all participate together in 1 giant query with different indexes,

and I have an optimizer that knows how to choose between them. If the left hand side is really selective, it should do that part first. If the right hand side is really selective, it should do that first whether the vector's on the right or the left, like it should should have this massive thing that's built that it plans

and all of the indexes participate together and and I think that that is the best path forward for for this, is kind of unified systems where

the distinction between vector search and metadata filtering is, they're synthetic in the same way that like in a database different using different kinds of indexes within your conventional queries would is is an implementation detail. It's not a thing that the users should be worrying about. So I'm a I'm a huge believer in that path forward, And then for us in particular, the the real time component is super important for us. I I wanna be able to I wanna be able to ingest vectors and have them available quickly. I want the searches to be quick. I want the searches to be fast. I want any metadata changes to be updated quickly. I think these are the core

components of, like, what we're building and, like, how we're thinking about the path forward here. So, yeah, I'll just reiterate that. I I think we we we have talked about that, but

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think that real time analytics

is a deep is behind.

We, us, are behind

in a way that has caused enormous problems to people.

And a huge number of systems are abused

because we don't have good real time systems that, are allowed. This is the problem we were trying to solve. Like, again, I I I don't wanna pick on it, but if you go to any team that owns a nontrivial Elasticsearch cluster, they can tell you 6 different ways their Elasticsearch cluster is being abused, often for, like, real time analytics. So I think that particular piece of our industry is way behind, and we're trying to catch up here. We're trying to help catch up here. The other 1 I will give you in an ML context, I think we data people have deeply failed to build first class citizens for ML concept. So yeah, like I can give you rows and tables and whatever whatever but but like an ML person wants features. They want features that get that have a specific computation that generates them. They wanna track them over time. They want the version control. They wanna AB test them. They wanna time travel them. All of this stuff can be built with enough work on your existing data platforms, but it's like the gap is really wide and bad. It's very unusable and so I think data people have really failed to build

usable primitive for ML and AI stuff. I think we have a lot of work to do here. And again, like, the vector life cycle stuff we were talking about today is just another example where we have a lot of work to do on top of what we've already built to make good primitives in an ML and sort of AI landscape. So those are 2 those are 2 big ones. 1 1 in general and 1 for, for AI MLs.

Absolutely.

Well, thank you very much for taking the time today to join me and share your thoughts and expertise on this problem of vector search, how to think about it, how to apply it. It's definitely a very interesting and valuable problem area, so I appreciate all the time that you and your team are putting into making it a more tractable problem. So thank you again, and I hope you enjoy the rest of your day. Thank you for having me, and I hope you enjoy the rest of your day as well.

Thank you for listening. Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpod

cast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links