Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating

the quality of the data and causing teams to lose trust.

Siflae solves this problem by acting as an overseeing layer to the data

stack, observing data and ensuring it's reliable from ingestion all the way to consumption.

Whether the data is in transit or at rest, Ciflae can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams on their preferred channels,

all thanks to over 50 quality checks, extensive column level lineage, and over 20 connectors across the data stack. In addition, data discovery is made easy through Siflae's information rich data catalog with a powerful search engine and real time health status. Listeners of the podcast will get $2, 000 to use as platform credits when signing up to use Siflae.

Siflae also offers a 2 week free trial. Find out more at data engineering podcast.com/ciflae

today. That's s I f f l e t.

Your host is Tobias Macy, and today, I'm interviewing Frank Liu about the open source vector database Milvus and how it simplifies the work of supporting ML teams. So, Frank, can you start by introducing yourself? Thanks for having me on the show today, Tobias. So my name is Frank. I'm currently a director of operations at a startup called Zillow's. We do vector databases. You know, I've been sort of in and around ML for quite a while. After graduating,

I worked on the computer computer vision and machine learning.

I was sort of like a hybrid research software development team over at Yahoo.

I had my own company for a while. We were doing machine learning for IoT platforms,

And now I am at Zillow's, where we do sort of ML infrastructure.

And do you remember how he first got started working in data?

That's a great question. And I wanna say, you know, back 6 or 7 years ago, I'm at Yahoo, I think, that's when I really was able to see what big data means, and they had quite a large amount over there. Big data, I think, was back then was more of a catchall term. Some folks considered it, you know, if I have a dataset, maybe a 1000000 examples or a 1000000 samples, that'd be big data. Other folks, it might it might have been 10, 000, 000. But to Yahoo, we had, you know, billions of

images, you know, everything you can think of. I think a big challenge there for the team that I was on in particular is that

we needed to figure out how all of those fit into these different models, how the all those pieces of data could be used

to sort of improve a lot of business functions.

And I think a lot of that really spills over into

what we're doing today with vector databases as well.

So in terms of that, can you describe a bit about what the Milvus project is and some of the story behind how it came to be and why you decided that this was the area that you wanted to spend your time and

energy? Movers, I wanna say it was created at Zillow's probably about

in 2018,

and it was open sourced, I think, in October of 2018 as well.

And the idea behind there is that we had so many users come to us with so many potential customers come to us and say, look. You know, we realize you're doing a data storage platform, but we need something to be able

to store, index, and search across massive quantities of data. I think embedding data, so vector data. You know, 1 of our very first users came to us and said,

we have 50, 000, 000 of these vectors. This was 4 years ago, so they already had quite a bit of data that they wanted to search and index already. And

this is what really prompted us to get started on Milvus, get started on this vector database.

We did end up open sourcing it later. It's now Milvus is actually now part of the LF Linux Foundation for AI and data.

And I'm gonna talk a little bit about sort of what a bigger database is and how it works as well. So what I've asked as a follow-up to this question,

the idea behind a vector database is that it's meant to store what we call unstructured data. You see, I wanna say, you know, ever since there's so many of those these really, really popular SQL databases out there, MySQL plus, 3 SQL, you know, some of these some of these very distributed ones as well, Cockroach

CockroachDB,

so on and so forth. And all of these are meant to store

in some way, shape, or form,

either tabular or other structured slash semi structured data, objects,

JSON documents. You know, if you're using MongoDB,

Redis is a key value store.

But so much of data out there today, right, is unstructured. And by unstructured, we're looking at images. For example, we we have images of a variety of different sizes

and variety of different resolutions.

Audio,

it could be

different audio snippets

spoken by different people, music,

images, audio, you know, video, text, all of these are what we consider unstructured data. And there's so much of it out there today. Right? A good statistic I'd like to quote is that I think something like 720

1, 000 hours of YouTube videos are uploaded per day,

and we wanna have a way to search across all of these videos

from a semantic perspective. Right? We wanna be able to search across all of them, not just by, let's say, the title of the video

or in the context of YouTube or any other video sharing platform,

not just the title of the video or the description of the video, but also the contents of the videos themselves. We also wanna be able to use these pieces of unstructured data. That's really where the idea behind the vector database was born.

And I wanna say within the 1st

year or 2 years, we ended up communicating in some way, shape, or form with over a 1000 enterprise

users of this open source project, which

we saw tremendous growth internally at Zillow's. And, yeah, it's been a ride ever since. Digging a bit more into this idea of the vector database, there's another conversation I had a little while ago with the folks from the Pine Cone project,

and there's

a I know it's not the same space, but at surface value, what could be viewed as a similar concept of the bitmap database that the folks at Hellosa Project builds where it's actually a database that just stores the indices on the underlying data so that you can do faster querying across these different datasets.

And I'm wondering if you can just talk to some of the types of use cases that a vector database enables and how it might compare to or contrast with this idea of a bitmap database. And given that these are

sort of unique types of database engines compared to the systems that you mentioned that people are familiar with in the MySQL, Postgres,

Cockroach, you know, document databases?

I want to say that

the use cases for vector databases,

I think most people today would associate them with you have an upstream machine learning model and you wanna be able to, you know, store index and search these embedding vectors from those machine learning models. Right? But there are plenty of other examples out there of folks

turning pieces of unstructured data into

vectors or into embeddings using these, what I like to call, handcrafted algorithms. So

I'll go over some of the ones, sort of full size as well that use these machine learning models as well as ones that use more handcrafted

features.

The idea here is that

I'm gonna go back to the example of video. Right? Let's say I am a short video platform, and I wanna be able to understand

user behavior. I wanna be able to recommend new videos. Now if you look at TikTok and

some of the other sort of pretty popular short video platforms out there. I think a lot of folks would say that they do a pretty good job of doing recommendations.

There's quite a bit of semantic information from inside the video that we can look at as well. And

a common use case that we see here in the news community

is taking the video,

indexing either the entire video or individual frames of the video,

and being able to recommend new videos based off of that. Or let's say being able to serve ads based on the contents of the video that I like to watch.

That's 1 of many use cases when it comes to video.

Another 1 that I like that I like that I enjoy talking about with other folks is

security. So there's a common or not a common excuse. I'm not sure how common it is, but, you know, Trend Micro at some point came to us and said, you know, you guys are doing great work. We wanna be able to do threat analysis over these APKs, over Android APKs.

So we have an existing pool of APKs that we know is malware.

We've taken them. We've analyzed them, and we wanna be able to understand if a new APK comes on. If new APK comes online, we wanna be able to index that and see if that's malware as well.

And with the number of apps that are out there today and just growing, you know, they had quite a large

system into production. They had quite a large number

of APKs and other applications that they were scanning rent. I think this is a great example of an application that doesn't necessarily use machine learning upstream.

And what they did was they took the contents of APK itself. So they looked at, let's say I don't know all the details, but the number

of reads and writes to disk or how often it accesses the network.

They turn all these features, and they put all of them together into an embedding, I think, very much in the traditional

machine learning sense. And they use that, you know, to great

effect within Milvus. Right? So these are just 2 of the many examples, and I really highlighted these 2 because

you see

for in the video case, if you wanna do recommendation or if you wanna serve ads,

there's these machine learning models that sit upstream. But when it comes to, let's say, threat detection, something that we did with Trend Micro, there's a lot of other features that you can use as a part of the embedding as well. And I'm also just gonna list off some of the other use cases that we see these days with vector databases.

You know, we see

intelligent chatbots. That's actually quite a common 1 for a lot of our user base.

There is

reverse image search.

We did a very, very you know, Cleveland Museum of Art. They also did 1 of my favorite applications where they got a reverse research

application up and running

in, I wanna say, just under a week with a very, very tiny team. So I thought that was very impressive.

There's semantic text search. So if I wanna be able to search over text, not just by keywords,

but also, let's say, by the content, the text or contents of the phrases themselves. And there's many, many more out there as well. As I mentioned, I think we've onboarded at this point. I wouldn't say onboard, but we've communicated with in some way, shape, or form over a 1000 enterprise users.

The second part of your question earlier, talking about these other different types of databases,

There are bitmap databases. You know, you have your traditional you know, we were mentioning

Cockroach, Postgres. You know, there are these traditional relational databases. And what I will say is that a common question that I get, I think a common misconception

when I speak to a lot of folks, when I talk to a lot of folks about vector databases, about our architecture,

and

even, I would say, when I talk to folks in the Moobus community and internally as well is that, you know, Moobus

1 day is going to replace

MySQL or replace PostgreSQL, replace MongoDB or whatever. And that's absolutely not the case. We don't see

Milvus as replacing all of these traditional relational or traditional, you know, object databases, wide column stores, so on and so forth. But I think sort of my vision for vector databases that they're used in conjunction with these traditional databases, Mongo, Redis,

etcetera,

to build you know, to open up new revenue streams if you're a company or to help you understand

the massive quantities of unstructured data that you have as well.

Milvus is not meant

to be like

the next evolution of a single database that you would ever wanna use. That's not the case. Right? And I think

down the road, there will be

more and more, you know, let's say, in the next

3 to 5 years, you will see more and more of these applications that use both of these different types of databases, databases for unstructured data and databases for structured data, use them in conjunction, use them to great effect.

In terms of the similarity search capabilities,

1 of the things you mentioned is being able to

generate these embeddings of video files and then determine whether or not there are similarities between different videos or the reverse image search use case. And I can imagine it being used in a copyright enforcement use case where you can say, I want to detect whether or not there is some portion of this song used in this video. And I'm wondering if you can talk to some of

the applications of being able to

do kind of subset matching where you say I have these 2 distinct

vector embeddings.

I just want to know whether or not there is some common portion of these vectors, Deduplication,

both

audio

and video deduplication, is

a Deduplication,

both audio and video deduplication,

is a big part of

our

user

cases as well. And

you were talking about specifically

how

let's say I have, like, a subset of the vectors that I wanna look at or or particular, let's say I have a 10 24 dimensional vector, and I only wanna look at the first 100 elements. Right? And this also sort of ties in with the with the idea that a vector database is very, very different from a traditional relational database or an object database or any other database for semi structured data.

And

when it comes to vectors, you really are looking mostly at

2 things, I would say. The distance between

different vectors that correspond to, obviously, different pieces of unstructured data. Let's say I have 2 images and, you know, there there are 2 vectors there.

So you would look mainly at the distance between those 2. And then the other thing that you might wanna look at is sort of vector arithmetic, I like to call it. So I'll talk into both of these in a little bit further detail.

The first, I think, is when you look at the distance between 2 vectors, you're looking at them as a whole. So

I was at SIGMOD, I wanna say, you know, which is an academic database conference, I wanna say about 2 months ago,

and, you know, I was talking to a lot of folks, and a lot of the questions that I got were, you know, why can't I store these vectors in a tabular database, in a relational database,

and have each column

be an element of the vector. Right? So I have, let's say,

a 1000000 images, and I have these 1000000 images correspond to a 1000000 vectors, vectors that come from these machine learning models that are generated from these machine learning models. And I think there's a lot of misunderstanding there about what the vectors are there to represent.

I can absolutely do that. Right? So I can absolutely

store vectors in a relational or a table based format, but each column of the vector is

you know, it's not very, very meaningful. Right? And what's more meaningful is the distance between them. So for folks that who are a little bit more, maybe sort of ML or sort of data science savvy,

if I have these high dimensional vectors that come from these machine learning models or high dimensional tensors,

And let's say I have a million of them. If I were to multiply all million by

the same sort of n dimensional or high dimensional rotation matrix,

the resulting

sort of embedding space that I get is actually exactly the same as it was before. It is an entirely valid embedding space.

And in the context of a machine learning model, I would argue that a model,

depending on the input data and depending on the model architecture,

could learn that embedding space just as easily as the original 1. So a bit off topic, but getting back to the original point. Right? What you really wanna look at for embeddings primarily is the distance between them. And when it comes to distance, look at l 2 or we look at cosine distance, the actual

individual elements of the vectors themselves don't really matter too much.

The second thing, you know, that I was talking about earlier, vector arithmetic,

works kinda like this. Right? So and I'm I'm trying to explain it as best I can without a whiteboard.

But vector, it works like, you know, if I have an image of a

sailboat, and let's say the sailboat is white. I take that image, I run that through an ML model, and I get a vector out of it.

And then I have, let's say, text that using some multimodal model can be embedded into the same space.

Let's say I have the text blue, right, the word blue. So if I were to take the embedding that corresponds to the image of the sailboat

and take the embedding that corresponds to the word blue and I add them together, right, I add those 2 vectors together and I search for its nearest neighbor. Let's say I've indexed

a 1, 000, 000, 000 images in Milvus,

and I search for their nearest neighbor,

we would actually

most likely find a picture of a blue sailboat, an image of a blue sailboat. Right? Or if we were searching for text,

we might have the text blue sailboat. Right? And

these types of sort of vector arithmetic applications, I think, are very, very exciting as well. We're definitely more focused early on on that first chunk that I mentioned, first half that I mentioned, so doing just nearest neighbor search. But the vector arithmetic, I think, is very, very

that's really quite an untapped area as well. And I think a big area that Millers will be looking to in the future.

There's quite a bit of different applications when it comes to deduplication.

And, you know, there's many different ways to do it. Right? This sort of ties in with what I was talking about in terms of the what these embeddings mean and what these vectors mean. And that

sometimes when you wanna do do deduplication,

some of our users will actually take entire videos and try to do the matching there. In that case, if you have

a very long video and very, very short segment that you wanna detube out of it, it might not work as well. And then there's another subset where, let's say, we have an hour long audio snippet,

and we'll divide those into very small chunks, maybe 10 second chunks. And we'll index all of those chunks inside of Elvis. That type of deduplication,

because we do full vector deduplication, full embedding deduplication is much more robust. It's a little bit more stronger.

So when we look at an embedding, yes, we are taking this piece of unstructured data and turning it into this high dimensional vector. But at the same time, we generally don't look at subsets of the vector. We don't look at individual columns

or individual elements of that vector. It's more taking the input piece of unstructured data as a whole and turning that into into an embedding.

1 of the other things that I'm very curious about in this vector embedding space is

the potential for

implicit

collisions

in the ways that the vectors are represented, where you have a machine learning model or you have some custom algorithm, like you were mentioning in the case of the Android APKs that will

generate a particular vector representation of the information that you're trying to work with. But if you have 2 different machine learning models, you know, 1 of them is being trained on audio data, the other 1 is being trained on images or lidar or biological sequences.

What are the

potential risks of accidentally ending up in a similar vector space with these completely disparate data types where if you try to do a similarity search on, you know, a genetic sequence, you end up finding

the audio for a hip hop song by accident?

That's a great question. And I think when we talk about you're sort of touching on the idea of multimodal learning. You know, the idea that we have

different

modes of unstructured data. We have you know, I mentioned images, video, audio, text a little bit earlier in this podcast,

but we also have other pieces of unstructured data as well. You were talking about, let's say, a genetic sequence. We have protein structures, graphs,

maps, and, you know, pop songs as well. These are all pieces of unstructured data. But I think once you really get down into the nitty gritty, if you have a model that's trained on, a lot of that is really up to the application developer to make sure that they don't fall into the pit hole of, you know, accidentally taking protein structure and searching across,

song database. Right?

1 thing that I do wanna touch on, and I think I have mentioned this before as well, is

the idea that we are going to see

way more multimodal learning sort of methodologies in the future.

I think a common 1 that you see already is images

and text.

There's really, really powerful ones out there that are trained over tons of data, probably, you know, billions of examples, like CLIP, for example, from OpenAI.

And when you have these embeddings that are in the same space, you can do sort of like the vector arithmetic and other cool things that I was talking about a little bit earlier as well.

The reason why I envision a vector database to be such a huge part of pretty much all applications in the future

or all applications that have unstructured data in the future is because

I think we're moving

NML. We're sort of moving into a direction where we want to embed everything.

And not just embed everything, but also embed everything into

the same space. So

it's foreseeable 1 day that I could embed

text and protein structures into the same space, and I could be able to search over protein structures just by saying, oh, you know, what is the kind of structure that I'm looking for? Right? I'm giving it a textual query, turning that into embedding, and then searching for relevant protein structures.

It's also foreseeable that I could given, like, a 3 d molecular structure, I could find a pop song that maybe,

you know, a pop song from a biomedical

PhD that, you know, they're singing about this

protein structure or they're singing about this molecular structure in 1 way, shape, or form. I think that's definitely foreseeable. Right?

I think the idea that embeddings are gonna be a very, very core part of

a lot of

these future applications. I think the reason is because these embeddings are

becoming so powerful from a semantic perspective.

They're becoming,

you know, very, very much as a Swiss army knife

for different pieces of unstructured data and for correlating different pieces of unstructured data together. So great question there.

It's time to make sense of today's data tooling ecosystem.

Go to data engineering podcast.com/rudder

to get a guide that will help you build a practical data stack for every phase of your company's journey to data maturity.

The guide includes architectures and tactical advice to help you progress through 4 stages,

starter, growth, machine learning, and real time. Go to data engineering podcast.com/rudder

today to drop the modern data stack and use a practical data engineering framework.

And now digging into Milvuz itself, can you talk to the

architecture and implementation

of the database engine and some of the core system requirements that have factored

Azure cloud platform as well, which is currently in early access.

For Milvus,

this sort of ties in the history of Milvus.

Milvus 1 was, I think, very much

a single instance. You know, if you wanted to scale up, you need to increase the number of cores that you had or you need to, know, add more GPUs, you need to add more RAM, how's there much a single instant solution?

And Milvus 2.0 Milvus 2.0

is

entirely cloud native. It's production ready. It has a lot of database features that you would wanna see in a

distributed database these days, let's say, like Cockroach or, you know, like Snowflake.

It has, you know, failover,

replication,

full fledged features that you wanna see to put a database, any type of database into production. Right? You know, moving from Milvus 1 to Milvus 2 was actually really a huge challenge for us because the only piece of the architecture that remained constant was actually

the sort of compute engine that we call nowhere.

That's a Marvel movie reference.

It was a big challenge for us moving from this single instance

thinking, okay. Maybe we can take the data and, you know, have a store on that instance and do queries, indexing,

searches, and storage all on that single machine to a

massive cloud native,

cloud focused solution.

And

diving a little bit more into

the architecture of Milvus 2 I'm gonna focus on Milvus 2 dot x here. We try to separate out

storage from compute,

indexing from compute. We have multiple

sort of different types of workers. We have data nodes, index nodes,

and query nodes.

These are all independently

horizontally scalable from each other. A big thing a very, very unique thing I think that we've done that I have yet to see any other vector database, either open source or closed source do, is that we've separated a lot of these plants. We've separated a lot of the plants. We've separated compute

from

the storage. We've separated

coordinators

from the sort of workers, and it's also very much a streaming based service as well. So we use Kafka or Pulsar as a part of the overall Bellis architecture.

And what this has allowed us to do,

it has given us the ability to

let's say, instead of provisioning a machine for our users, let's say our user, they wanna use a vector database, They come to Zillow's cloud, and they're like, hey. I I have this application, an easy vector database. I wanna be able to do something very quickly.

Instead of provisioning a machine for them, what we now can do is

we,

provision the solution for them, and they have the option to purchase these these compute units. Right?

And these units

will

serve as

sort of currency for them to be able to do all sorts of different types of computation

within Milvus.

If they have, let's say, data and they just wanna be able to store it, they don't wanna do any compute over it just for the time being. Let's say they're in a it's still in a development phase or an inactive

state, they don't have to use any of those CUs, any of those compute units. And I think you see a lot of these sort of, like, data lakehouse models from Snowflake Databricks

that have done really, really well in this perspective.

And

we have tried to really replicate that for vector data as well. That saves a lot of effort in terms of application development for the user, but it also saves money as well.

And for us, though, I think for us as a part of the Milvus community and on the Zillow's team as well, it definitely has been

switch from Milvus 1 to Milvus 2 has absolutely been very, very difficult. And we I wanna say we've spent over a year and a half, 2 years just doing this switch alone

just so that we can have an architecture that is

yes, you know, it's horizontally scalable. It's very, very flexible in that perspective,

but also an architecture that is something users will be willing to put into production. It has all of these features fail over replication,

you know, across many different colos.

As far as the scalability

aspects,

I'm curious what the storage layer looks like and some of the challenges that you've run into of being able to figure out how do I

shard or partition

the different

vectors in a particular collection that I'm trying to store

so that I can optimize

for speed of

access and analysis

to be able to provide this very

quick interaction of being able to say, here's my input vector. Give me the nearest neighbor so that I can determine what is the actual resulting

I'm trying to find doing the reverse image search, find me the closest image and doing that in a time span that is interactive and conducive to an end user application.

This ties into the whole idea of a vector index, and great question as well. Absolutely keep these coming. This ties in the idea of a vector index as well, where in any database, you wanna have indexes to help you to help you do these queries very quickly. I think in, like, for

maybe indexes based off of features or so on and so forth.

But in doing these searches quickly, we want to be able to

build these indexes in a very efficient manner. And

for what we do, not just for Milas, but for any vector database as well for a variety of different vector search or vector database solutions that you see out there is that

there's always an indexing component. There's always an index building component, and there are many different types of index that you can choose from.

Most common 1, I believe, is probably HNSW,

which is a graph based index.

But Milvus, I think, is fairly unique in that our sort of core engine allows you to select from 1 of many different types of indexes.

So HNSW, of course, is 1 of them, but we also provide another index, I think, version from Eric Burn at Spotify

when he was at at Spotify, excuse me, called Annoy.

Approximate nearest neighbors. Oh, yeah. There are, you know, product we provide, indexes based off of product quantization,

and just these

flat indexes as well. So if you just wanna if you have, let's say, a very, very small database and you don't wanna build an index specifically for that purpose, you wanna search through all of vectors every time you do query, that's absolutely okay as well.

So these indexes I'm not gonna go into the details of how all of these indexes work, but these indexes play a key role in allowing us to search across

massive quantities of unstructured data, massive amounts of embeddings

at a very, very fast rate. So we always try to keep our query latencies,

you you know, within 5 to 10 milliseconds.

And I think that's sort of the first answer to this. And I think the second part to this question is

we have multiple functions within the vector database. There's indexing,

querying

data, so on and so forth. There's a coordinator. Each of these functions can actually scale

as you need them to.

So, for example, if I have an application that's actually writing or doing updates, inserts deletes a lot to my vector database, but probably not that many queries, and I'm using it more as a storage engine,

I can perhaps scale out the data nodes. I can scale out the indexing nodes as

I see fit.

Right? On the other hand, if I have an application where the data is mostly static

or entirely static, I've seen a lot of those use cases as well where the user will simply upload a lot of vector data to our vector database,

leave it there for, let's say,

you know, however long. They'll be doing a lot of queries across it. We can scale out our

query cluster in accordance

with that particular use case as well to be able to utilize the indexes that are already built

and to be able to

effectively

without, you know, boot up the right types of machines that gives us the most amount of

trying to search for the right word here,

that gives us the most amount of efficiency and performance when we want to do those queries there. So I hope that more or less answers your question, Tobias, where there's the component of the index, but there's also the how we use that index internally within the Milvus architecture.

Definitely would be a little bit easier if I had a whiteboard here talking about all this, but, yeah, I hope that's mostly clear there. It's definitely

a challenging aspect of running a podcast on these types of topics is when you start to get into the nitty gritty, you have to try and figure out how to build the appropriate word pictures without just completely turning everybody around. I'm trying to correlate. I have this very it's like a diagram sort of in my mind, and I'm trying to figure out which pieces of that diagram is relevant for the topic at hand. But please go ahead. As you were talking about

the architecture

allowing for tuning these different aspects of how the database

is being used and being able to support various workloads.

We might have a heavy rate capacity and a light read capacity or vice versa.

That also brings up the question of

the CAP theorem and where it sits in the CP versus AP

continuum and some of the ways that you think about that, particularly as you're tuning it in these different directions for high throughput or expanded storage capacity, high write, high read volumes, things like that.

I am actually really, really glad that you asked that question because here at Xelius, we have what we like to call the new z of Ethereum. So

for us

for us, the c is sort of cost effectiveness or cost efficiency. Right? And I'll talk about all the c, the a, and the p as a little bit later.

But the a is accuracy, and the p is performance. This is what we call the new CAB theorem for vector databases.

You know, there's probably gonna be a blog post coming on that very soon, but I'll go over it very quickly here. You know, the idea is let's say I am a very, very large company. I have a lot of data. I wanna be able to search for the data very, very quickly, but I also wanna do it very, very accurately.

By accurately, what we mean is I wanna search for the actual

nearest. If I search for the top 10 nearest neighbors, you know, I wanna be able to get the actual top 10, let's say, 99% of the time or 99.9%

of the time. This sort of fits in with I didn't go too much into detail of this, but this sort of fits in what we're just talking about vector indexes,

where you have this index, but it is

inherently

probabilistic. Right? It's not deterministic like you would see for, let's say, a relational database or database for semi structured data object database.

The idea being that when I do this vector search, it is approximate nearest neighbor search. It is ANN.

The index allows me to search for the approximate nearest neighbors to a particular query vector or query embedding.

There are some different parameters that I can tune to give me a much higher probability

of getting the actual nearest neighbors. If I search for the top 10, perhaps I only get the top 10, let's say, 90% of the time. There are some knobs that I can tune to give me maybe 99%

or 99.9%.

So that's where the accuracy perspective of the CFE theorem comes in. The CFP theorem for vector databases,

there's accuracy. I'm a very large company, and I wanna be able to very accurately

and very quickly

search for my top nearest neighbors. Let's say my top 10 or top 100.

In that case, I could throw a lot of machines at it. I could throw a lot of compute resources at it, say, like, a 100 hundreds of machines.

And I could really get my results very quickly with very, very low latency, very high throughput.

And at the same time, I would get very, very accurate results. So I would pretty much always get the top 10 or the top 100 or the exact top n that I'm looking for. But, you know, you have you're throwing so many computer resources that it's not very cost effective.

Right? It's gonna have very high cost in terms of CPU, GPU.

If you're running accelerators like an FPGA accelerator, it'll have high cost in terms of accelerators as well.

In line with the CAP theorem for more traditional databases,

it's really a

choose 2 out of 3. Right?

Now if I have an application where, let's say, I am less concerned about accuracy,

I don't need as heavy of an index.

I can actually do the querying much faster, but then I lose a lot of the top end the potential top end accuracy. So

let's say I can do a query much quicker with fewer resources,

but perhaps I only get perhaps my in the top 10 results that I get,

maybe only

7 or 8 of them are in the real, quote, or the actual top 10 results.

So in that case, I've traded off accuracy

for cost effectiveness.

Right? And there's also other examples

where you may want the c and the a rather than the p. Maybe I can withstand

these 1 second or 2 second latencies,

very, very low throughput.

But I wanna use very few machines, and at the same time, I want very, very accurate results. We've seen a lot of those applications come in when it comes to 3 d molecular search as well. So,

yeah, yeah, this is sort of like

a it's, you know, it's sort of like a sneak peek of what we like to call the new CFE3 and for vector databases. Sort of went off on a tangent there, but I can never resist

the opportunity to talk about this with other folks. Absolutely.

Hope that sort of makes sense. No. It definitely does.

And so

from a practical perspective, for people who are interested in using Milvus,

they are either using machine learning models to generate these embeddings or they have their own specific algorithms for being able to build these vector representations.

What is involved in the process of actually

getting it set up and integrating it into an application and building on top of it, and in particular, thinking about some of the data modeling aspects of how do

I assign appropriate metadata, what are some of the organizational capabilities,

being able to understand

what the,

sort of concurrency capabilities are, things like that.

Milvus, I think, today is

easier than ever to get set up with.

Millis inherently, you know, we have a distributed architecture.

So, you know, it uses Kubernetes as sort of like the execution engine there. We have Helm charts that you can use just right off you know, just straight from the mills I dot I o website that you can use to deploy a mailbox cluster into on prem.

And, really, the whole idea there is that we try our best to abstract away as much of the vector database maintenance as possible.

Now if you do choose to do on prem, you obviously have to maintain the compute resources, you know, the compute resources on your own end. We also, as I mentioned earlier, have a cloud version that currently, you know, as of July 2022 is in early access. It's only a closed beta type thing. And,

you know, we will open it up to to other users, I I wanna say, in about a month or 2 as well. So definitely keep on the lookout for that. But when it comes to actually getting up and running with a Milvus deployment,

I think it is absolutely easier than ever, especially as it comes to Milvus 2.x, our our cloud native version of Novus,

where

you can you know, if you want to play around with it just locally on your on your laptop computer, maybe you're a data scientist, you want an ex a very small number of embeddings, let's say, only only a 5, 000, 000 or 10, 000, 000 embeddings,

you can absolutely do so. You can do it with a simple if you're running Linux, you can use Aptyr Yum, do a standalone version

of Milvus.

And then what I think is very, very unique is that we can later on, if you do decide to put that data into production, you can sort of take that because we use

MinIO as the underlying object storage across all of our platforms.

You can take that and essentially put it into production

very, very easily. Right? You can put it into a cloud native. You can put it into cluster mode very, very easily.

And, you know, we're working towards being able to do the same for our Zillow's cloud platform as well,

where if you have an internal on prem Milvus instance of Milvus, you know, across your cluster,

you can

do the migration to cloud as well. So you have all these vectors that you're storing locally in an internal cluster.

We give the option to do that. We will have the option of having all that data in cloud in a managed service as well. So

there's different

levels, I think, that organizations

or companies,

research groups are comfortable with, And we try to make for each of those levels of comfort, I guess, so to speak, we try to have a solution. We try to have something for all of those users.

When it comes to actually building an application

on top of Milvus,

you have the ability to do these

nearest neighbor searches on these vectors.

And for the case of, for instance, the reverse image search, I'm wondering if you can talk to some of the ways that you go from,

I have this input vector.

I determine this is the nearest neighbor to that, and then mapping that nearest neighbor vector back to the actual source object to be able to return the image that you are looking for. And some of the ways that you think about the application architectures where you're using Milvus for these fast searches, but then mapping that into the broader application

architecture and the ecosystem around it. You know, specifically, when it comes to the case of reverse image search, we actually leave

the

non vector data management

more or so up to the user. And I think that

we do that for multiple reasons. Right? The first is to give in the application layer sort of like a level of flexibility, some certain degree of flexibility for the application developer or for the engineer who's working with the vector database. The second reason is because it also simplifies things for us. Right? So, typically,

if I were building,

let's say, a reverse search system, I think it's 1 of the very common ones

1 of the very common applications for a vector database.

What you want to do is

we work primarily just with the vectors

themselves. So if you have a vector that corresponds to an image, you can index that into you can insert that into Milvus. Milvus will do all of the indexing, the flushing,

the sort of vector data management in the background.

And Milvus will actually return

sort of, like, an object ID. So it's very analogous to the underscore ID field in MongoDB, right, or essentially something that corresponds,

something that you can use to

map the embedding back to your original object.

When we return that object ID or not object ID, but when we return that ID more generally,

it is really up to the user and the application level to figure out what that ID which piece of unstructured data that ID maps to. And you see this already ties in with what we were talking about earlier when we were chatting about how a vector database is not really meant to replace these relational databases or replace these key value stores.

There's always the use case. Let's say I have

I have an s 3 path that corresponds to an image or I have a file on my local disk that corresponds to an image that I wanna index in my reverse image search solution. I still need some other external way of mapping,

let's say, that URL or that path on my local drive to that particular object ID.

But all of the vector data management,

all the unstructured data management, so to speak, that is left entirely, you can leave that entirely to Milvus,

we will

figure out a way to store that vector. We'll figure out when to index that vector into

1 of the A and N algorithms I was talking about earlier. And we'll figure out how

when you have a new vector that you wanna query across the exist that you wanna use to query across the existing database, we'll figure out how that query can be done most efficiently.

So that's just specifically

in the example, you know, if you want to build a reverse search system. But

I think that also ties in with many of the other systems that many of the other applications that we were talking about earlier as well. So if you were to do video deduplication,

you would have a

you would have these video snippets. You would index them, and you would do nearest neighbor search. If that nearest neighbor if the distance between a particular input

and its nearest neighbor was close enough, you would say, okay, this is a duplicate of the input. And again, it goes through a very similar flow there, where you have the data. You know, you get the amphibious and if it's and you want to figure out which of the ones in my database are sort of related.

Data teams are increasingly under pressure to deliver. According to a recent survey by snd. Io, 95%

reported being at or overcapacity.

With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise surprise they are increasingly turning to automation.

In fact, while only 3.5%

report having current investments in automation,

85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion,

transformation, orchestration, and observability.

Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture,

as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.

Go to dataengineeringpodcast.com/ascend

and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer.

In terms of your work on building Milvus and using it at Zillow's and working with your customers and the community members who are building things on top of it, what are some of the most interesting or innovative or unexpected ways that you've seen Milvus used? I'll go over 3, and I think I've touched upon this a little bit earlier as well. I'll go over 3 of my favorite applications.

The first is

AI drug discovery, as we like to call it, where you have a you have a pharmaceutical application or you have

a variety of these different proteins, a variety of these different molecular structures.

And, again, I was not a biology major in college, so probably not the best person to ask about some of the science or some of the biology behind this. But you take these molecular structures and, let's say, have a particular drug that does something that accomplishes something, perhaps tackles a particular symptom,

or it has some effects in the body. They also wanna be able to replicate using, let's say, you know, some other type of molecular structure. Right? That's 1 of the interesting applications that I've seen with Mobis there, where you can actually

discover

new drugs based on molecular structure of existing drugs

that tackle a particular problem.

I did mention where there was stuff that we did with Trend Micro when it comes to security,

when it comes to antivirus, and when it comes to sort of malware. There's also sort of packet inspection,

a lot of other things that can be done there as well.

I did touch upon this earlier. There was also something that we did with Cleveland Museum of Art.

That 1 is probably I think even though it is a very, very common use case, it's reverse image search, I think it's probably still my favorite.

I'd like to talk about this 1 quite a bit because

Cleveland Museum of Art does not have a very, very large engineering presence, but they were still able to create something very, very quickly within a week, this reverse image search system. This particular solution, I believe, is called AI ArtLens.

I believe that's the name. I do encourage everybody to go look it up.

When folks talk about

we are

democratizing this or democratizing that,

It can end up being a catchphrase

for

just taking whatever it is that you're building and saying, we are making this available to everybody.

But when,

really, push comes to shove, I think being able to take something that

traditionally has been very, very difficult to implement,

like reverse image search. If you look at maybe 5, 10 years ago,

it was pretty involved to build a reverse image search system. And on top of that, it was probably the performance might not have been that great either, especially, I wanna say, pre

pre ConvNets, before transformers, and before convolutional networks.

But, you know, today, I think when you see a combination

of the plethora of models that are out there,

great models as well for computer vision, for NLP, for audio, for video,

you see vector databases.

You see a vector database like Milvus

combining these 2 and being able to do all of these very, very unique

search solutions and these very unique rank recommendation systems.

I think that's really what is amazing to me about some of the infrastructure that we're building, you know, in conjunction with the greater Milvus community. Right?

That's really why I choose to focus on you know, we have, like, a number of sort of enterprise users that have much bigger deployments of Milvus. So, like, Compass, Shutterstock, I believe they use Mufs internally. Those are 2 other examples.

You know, we have a great number of enterprise users that deploy and use Mux internally, but

I always like to focus on the smaller deployments. I always like to focus on the ones that maybe require, like,

1 person week or maybe 1 person day to accomplish because

those to me are

that is

why I

love

the Milvus ecosystem so much.

You know, that's what really brought me to Zillow. That's what really keeps me, you know, engaged with the Milvus community and with a lot of the folks that I speak to is the capability

of

you know, for the even these very, very tiny teams to do something incredible that they wouldn't have been able to, let's say, 5 or 10 years ago.

In your experience of working on Milvus and building around it,

Zillow's and working with the community, what are some of the most interesting

or unexpected or challenging lessons that you've learned in the process?

As an open source community, you know, we always get a lot of sort of a lot of folks who come and ask questions about Milas, and they say, oh, you know, how do I tune this particular parameter to get better performance out of my database?

And that's great and all. But

I think

what we've really

seen in just the past year, I wanna say from, like, April, May, or June, that kind of time frame, is an uptick in the number of folks who are actually engaged in terms of, oh, you know, there's

some work that I wanna do. There's some particular development that I wanna do. I wanna help

contribute to the most community as well.

And

when it comes to open source, when it comes to community engagement,

I really do believe that the best open source communities

have folks who

want to be

engaged

with the open source project, with the project itself,

not only as a user, but also as a contributor. We have definitely seen a massive uptick within the moves community in even in the past year.

I think part of it is due to we have these other projects that are a part of the vector database ecosystem as well. I'll mention them very, very briefly here. We have Tohey, which

I actually sort of cocreated that open source project with some of the other folks at Zillow.

And Toki is all about doing the upstream

embedding generation.

It's much closer to machine learning. It's much closer to Python, a bit closer to data science as well.

I think through, you know, all the people that we've brought on with Tohi, you know, they've been able to take a look at Novus and say, well, this is really awesome. And by combining Tohi with Novus, maybe what took me 3 months, 10 years ago, what took me

3 weeks, maybe

3 years ago, I can do in 3 days

today using these 2 open source projects.

We also have these visualization tools as well as these management interfaces

called Ah TO and Federer, respectively.

Obviously, I encourage everybody to go take a look at those on solos.com.

And these,

by reaching out to a lot of the folks, you know, in data science, in, you know, front end engineering,

in

even back end engineering, folks who might not originally know about Milvus,

by having this

broader vector database ecosystem,

I feel like we've been able to really reach out to many, many more people than we would have been able to if we just had Milvus. Right, if we just had

something, if we just had the vector database component itself.

And seeing that kind of growth in the past year, not just for Milvus, but also for these other projects that I talked about, I think that has been, I think, 1 of the shining points, I think, in my time here and being a part of the community as well.

As you continue to

build and iterate on the Milvus project and build your cloud offering at Zillow's on top of it. What are some of the things you have planned for the near to medium term or any particular aspects of it that you're excited to dig into?

I won't go too much into Zillow's Cloud. Obviously, what we're looking to do is take what we have done in the open source community, take Novus,

make that accessible to a lot more people who wanna manage service.

So this cloud, we will continue to add features, add functionality to that. We're gonna onboard, as I mentioned, the entire vector database ecosystem. We're gonna onboard Tohi to that.

I wanna say our goal is

around June or July of next year to be able to have this embedding generation platform

on Zillow's cloud as well.

When it comes to Milvus itself, we are going to continue to improve

Milvus not only in terms of speed,

but also in terms of accessibility.

We are adding hybrid search, so the ability to search across both metadata as well as the embedding itself.

There's going to be GPU support and perhaps even,

know, even further down the road, some accelerator support as well, FPGAs,

TPUs, MPUs.

My background is actually I was an electrical engineering major in college.

So these accelerators are very near and dear to my heart,

where,

you know, in addition to hybrid search, I do wanna say that, you know, probably we're going to be looking at some point at some more advanced functionality as well.

There's quite a few things that we

want to do that we really haven't gotten to yet that we'll try to be pushing out with Moebas 2.2, 2.3, and Moebas and Beyond. Right? Moebas 3.04.0,

so on and so forth.

All these features, we obviously plan to have on our cloud platform as well onto this cloud to really make that accessible to as many people and as many organizations as we can.

Are there any other aspects of the Milvus project or the overall

capabilities of vector databases that we didn't discuss yet that you'd like to cover before we close out the show?

I do wanna speak a little bit more about the broader vector database ecosystem. I did touch upon this earlier.

I wanna introduce

some of these other open source projects that we have in a little bit more detail.

Tohi is meant to be the this upstream embedding generation platform.

We

refer to it internally within Zillow as a new ETL

for unstructured data,

or you see, traditionally,

a lot of these ETL platforms have always been about

doing data loading, doing this data transformation, and perhaps

creating data as an input into machine learning models. But as machine learning and as these embedding generation

methods become

more and more sort of production ready, more and more solid,

you see them be a part of the ETL pipeline as well. And that's really what Tohi aims to do there. We have Atu, which is essentially a management platform for your vector data. It is a management front end where you can actually go and very, very easily,

let's say, take a look at some of the internal workings of Milvus

and FEDR, which ties into that as well. It is an ANN search index visualization tool. I think anybody who goes and take a look at it will be that. It's really, really awesome. We can actually visualize a lot of these and an algorithms in JavaScript. So you can see how HNSW,

how it works in terms of the different layers,

in terms of algorithmically, okay, perhaps this is how this approximate nearest neighbor search is being done with a single query vector. So that's sort of a bit more of an introduction to the broader vector database ecosystem that we have been

developing at Zillow's in conjunction with Milvus.

Obviously, Milvus still forms the core

of our engineering efforts in conjunction with the broader

newest community.

But I definitely do wanna bring these up. And I think these will continue to form a core part of what Zillow is trying to do

and really trying to build these tools and build the ecosystem

around Mobus as well.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work

that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think coming from a machine learning perspective,

I think

the intersection of ML and data, and it really goes both ways. So sort of

d b 4 m l and m l 4 d b's, I think there's a lot of work that can be done there. There's a lot of work that can be done in both of these areas. These are 2 very unique areas, mind you. Vector databases

and, you know, other

platforms like feature stores, so on and so forth, these are very much in the

DB for AIML

area.

And then you see a lot of work being done, I wanna say, primarily in academia these days where there is ML or AI for DBs as well.

Maybe to, let's say, speed up these. And there might be sort of more work that's being done out there, but maybe, you know, if I have a particular query and I can use ML, so let's say, figure out maybe how long that query is gonna take to execute. This is just 1 particular example. Right? So,

the intersection of

data

and machine learning AI, I think there's really quite a bit of work that can be done done there, both in terms of infrastructure as well as in terms of algorithms.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on Milvus and Azillis. It's definitely a very interesting project and interesting

problem domain. It's great to see an open source offering in the vector database space. So I appreciate all the time and energy that you and the other members of the Milobus team are putting into that, and I hope you enjoy the rest of your day. Thank you, Tobias.

Thank you for listening.

Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast

dotcom with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links