Summary
In this episode of the Data Engineering Podcast Kacper Łukawski from Qdrant about integrating MCP servers with vector databases to process unstructured data. Kacper shares his experience in data engineering, from building big data pipelines in the automotive industry to leveraging large language models (LLMs) for transforming unstructured datasets into valuable assets. He discusses the challenges of building data pipelines for unstructured data and how vector databases facilitate semantic search and retrieval-augmented generation (RAG) applications. Kacper delves into the intricacies of vector storage and search, including metadata and contextual elements, and explores the evolution of vector engines beyond RAG to applications like semantic search and anomaly detection. The conversation covers the role of Model Context Protocol (MCP) servers in simplifying data integration and retrieval processes, highlighting the need for experimentation and evaluation when adopting LLMs, and offering practical advice on optimizing vector search costs and fine-tuning embedding models for improved search quality.
Announcements
Parting Question
In this episode of the Data Engineering Podcast Kacper Łukawski from Qdrant about integrating MCP servers with vector databases to process unstructured data. Kacper shares his experience in data engineering, from building big data pipelines in the automotive industry to leveraging large language models (LLMs) for transforming unstructured datasets into valuable assets. He discusses the challenges of building data pipelines for unstructured data and how vector databases facilitate semantic search and retrieval-augmented generation (RAG) applications. Kacper delves into the intricacies of vector storage and search, including metadata and contextual elements, and explores the evolution of vector engines beyond RAG to applications like semantic search and anomaly detection. The conversation covers the role of Model Context Protocol (MCP) servers in simplifying data integration and retrieval processes, highlighting the need for experimentation and evaluation when adopting LLMs, and offering practical advice on optimizing vector search costs and fine-tuning embedding models for improved search quality.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm interviewing Kacper Łukawski about how MCP servers can be paired with vector databases to streamline processing of unstructured data
- Introduction
- How did you get involved in the area of data management?
- LLMs are enabling the derivation of useful data assets from unstructured sources. What are the challenges that teams face in building the pipelines to support that work?
- How has the role of vector engines grown or evolved in the past ~2 years as LLMs have gained broader adoption?
- Beyond its role as a store of context for agents, RAG, etc. what other applications are common for vector databaes?
- In the ecosystem of vector engines, what are the distinctive elements of Qdrant?
- How has the MCP specification simplified the work of processing unstructured data?
- Can you describe the toolchain and workflow involved in building a data pipeline that leverages an MCP for generating embeddings?
- helping data engineers gain confidence in non-deterministic workflows
- bringing application/ML/data teams into collaboration for determining the impact of e.g. chunking strategies, embedding model selection, etc.
- What are the most interesting, innovative, or unexpected ways that you have seen MCP and Qdrant used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on vector use cases?
- When is MCP and/or Qdrant the wrong choice?
- What do you have planned for the future of MCP with Qdrant?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- Qdrant
- Kafka
- Apache Oozi
- Named Entity Recognition
- GraphRAG
- pgvector
- Elasticsearch
- Apache Lucene
- OpenSearch
- BM25
- Semantic Search
- MCP == Model Context Protocol
- Anthropic Contextualized Chunking
- Cohere
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details. Your host is Tobias Macy, and today I'm interviewing Kasper Lukovsky about how MCP servers can be paired with vector databases to streamline processing of unstructured data. So, Kasper, can you start by introducing yourself?
[00:00:59] Kacper Łukawski:
Of course. Hello. My name is Kasper Lukovsky, and I'm a senior developer advocate at Quadrant. We are building a vector database that supports many applications which are related to large language models, but not only. And retrieve augmented generation is probably the most typical use case nowadays.
[00:01:18] Tobias Macey:
And do you remember how you first get started working in data?
[00:01:21] Kacper Łukawski:
Yeah. Of course. I have a software engineering background. And in one of my previous jobs, I used to work as a software developer, and we started building big data pipelines at this point. That was probably around 2014 or '15. And we build a couple of projects in the automotive industry, which were using Spark and all the Apache, tools that were popular back then like Kafka, Uzi, and many, many more. And that actually was, like, a natural transition, for me to start building this kind of solutions, including not only data ingestion, but also data visualization and business intelligence parts.
[00:02:03] Tobias Macey:
And so now digging into the current frenzy around the applications of data, how to make use of it, how to use it to power these various AI applications, obviously, large language models have had a drastic impact on the utility and applications of unstructured datasets, which have largely been stuffed off to the side and used for bespoke purposes or used as training corpus for natural language processing tasks. But with the capabilities and the scale that large language models offer, we can now turn those into usable assets for various applications, whether that's business analytics, but more generally for language model applications.
And And I'm wondering if you can just start by talking through some of the challenges that you're seeing teams face in building the pipelines that are necessary to be able to take that corpus of unstructured data and turn it into usable data assets.
[00:03:06] Kacper Łukawski:
Yes. Of course, there's been a massive, impact on how we build data pipelines for unstructured data if we use LLMs. And I feel like one of the things that we still forget about is that language models are not going to magically solve all the problems with our data, and they do not have any capabilities to fix it in any way. And, from my per experience, there are many teams struggling with bringing this data because they still don't understand don't understand the nature of language models, which might be making errors. It's not like a typical application where we write code. We can test it thoroughly. With LLMs, things are a little bit different because we can figure out the way of how to process data using it, using these models and then face some issues because this is not going to work on all the cases that we have. And quite a typical enterprise case is that people have lots of scanned documents or PDFs, and they want to bring them somehow, to their applications.
And there are various ways of of how to do that. Like the selection of a proper large language model is key here or visual language model because we want to interpret images here. But still, there are challenges related to scalability, to the deployment of these models, especially if we work in an industry that can't just use, proprietary SaaS based tools. Then the other teams start to struggle with with setting all the pieces up.
[00:04:34] Tobias Macey:
And in terms of the destination of those unstructured assets into some usable data assets, what are the typical shape that you're seeing teams use as that destination point where, unstructured sources, whether that's turning it into tabular data or extracting numerical data or potentially turning it into graph representations using something like named entity recognition. And I'm wondering what are some of the common applications that you're seeing teams use those LLMs for in terms of that transformation?
[00:05:17] Kacper Łukawski:
Yes. So definitely graph RAG is becoming popular. For example, we have just, finished a case study with our one of our users, and they built a pretty interesting system that was using LLMs to derive ontologies given some unstructured data. So that was applied to some restricted domains like law and medicine. And they were actually doing a pretty interesting system that was able to understand the relationship in the data, and they used dual modeling approach. So they not only they have vector embeddings used for capturing the semantics of the data, but also graphs to capture the relationships between different entries. And I feel like this is kind of a typical scenario nowadays. Like, everyone is speaking about graph rack. But when it comes to all the other destinations, yeah, actually, LLMs simplified a lot of problems that we are dealing with in the past, like name entity recognition, text classification, translations as well. So so there are various applications. Obviously, we had some other methods in the past. So, like, algorithms trade solely for a specific problem. Right now, LLMs became the factor standard for solving all of these problems at the same time. And yet, in my experience, vector databases are typically the destination for all the data they process because our users typically want to build some sort of search system that will power their agents or maybe just the search bar on the website. But this is actually the the, most interesting case for our users, how to finally start deriving some insights from the data that was not searchable over in the past. So, yeah, I would say this is the most important application.
[00:07:01] Tobias Macey:
And then in terms of the modeling that is involved in the vector storage of these systems, obviously, vector databases as a category have seen a massive growth in terms of their adoption and attention over the past two years because of the introduction of LLMs and the use cases around embeddings and rag. But there are also vector extensions to other styles of database engines, one of the more popular ones being PG vector. And so then there's the consideration about what are the additional metadata fields or contextual elements that you want to collocate with these embeddings. And so, obviously, Quadrant is more of a document store.
PG vector is an extension to Postgres, which sits alongside relational data. And I'm wondering how you're seeing teams think about the design elements about what the broader context and utility is that they want to get out of the storage medium beyond just the ability to have some means of storing these n dimensional arrays?
[00:08:09] Kacper Łukawski:
Yeah. So first of all, I would distinguish databases from search engines because that might be kind of confusing because I wouldn't call Quadrant a document store. It's more like a search engine. So if we would be looking for an analogy here, it's more like an elastic search that we use for keyword or lexical based search in the past. And this is just an alternative to to does to that paradigm that can also capture the semantics through the vector embeddings. And, yeah, like, quite a typical, way of storing these vectors is to also keep the input data that was used to create them somehow. And the idea is that if we speak about text, then we usually just put this particular chunk that was used to create a vector inside the metadata because in quadrant, every single point can have even multiple vectors and some sort of JSON like metadata.
And this metadata can actually contain the re the original data that that can be used to reproduce the process of creating the embedding. But in many cases, we just keep the reference to the original data, which is just stored somewhere else. That might be your relational database or data like sometimes like a URL to a file because vector search is actually not only about text, but it can also make search over images, video, or audio data possible. That was impossible in the past. So, yeah, just to submit up. So the typical approach is to just keep the original data close to your vectors because obviously, you will need it as these vectors do not capture, like, obvious, unique of the of the of the information. And the typical approach is to rather to use this vector for search, and they then take the raw data and pass it somewhere else, like to the LLM if you build rack, or maybe just expose this original data, the documents to the user if you build just search.
[00:09:58] Tobias Macey:
And then once you do have your data in this embedded format, you have the vectors, you have some reference to the original data, then there's the challenge of figuring out what is the retrieval method that you want to use where vector indexes play a substantial role. But to your point, with Quadrant being a search engine as well, I know it supports the BM 25 index versus the Lucene index that Elasticsearch relies on. And so for people who already have some investment in search infrastructure as they're starting to do that analysis of how do I want to now store my vector embeddings, whether they use the capabilities in their Elasticsearch or OpenSearch clusters versus migrating at least the vector pieces to Quadrant or migrating all of their search infrastructure to Quadrant, how they manage that evaluation and the, key decision factors that go into that?
[00:10:52] Kacper Łukawski:
Yeah. Obviously, that's a common question. And, well, there are many companies that invested a lot of resources and time into this kind of search engines, and they eventually also added this vector search capabilities. But the thing is that vector databases as a category were built solely to support dense embedding vector search and they are just way more efficient because the data structures that we use for dense vector search are totally different from the ones that we use for lexical search based, for example, min 25. So there are different ways of how you can incorporate vector search into your existing, retrieval pipeline. For example, you could obviously use the existing system and then just add these capabilities.
And, there are some differences in how we implemented this vector search compared to this, to this traditional systems, so to say. I remember in the past, they were just not scaling that well because they just had a single segment. So that was totally fine to use unless you are already dealing with millions of embeddings because then you could just store it in a single machine and the cost of running that would be just enormously huge. And all the vector databases were actually built with efficiency in mind. So we have implemented that as a first functionality and Beam 25 is or sparse vectors are just the addition to that, and vector search is still the key component. And if you have, like, a running search system and if you would like to experiment with vector search, it's relatively easy to just build a service that will be connecting to both of the systems. So we will have a great system that will be doing the lexical search with all the heuristics involved, and then you will have the most efficient vector search engine that will support the semantics because retrieval is actually very hard problem.
And sometimes you would be, handling your queries in a completely different way depending on their length or their semantics actually. LLMs are also quite, often being used for that. So so LLMs might be used to classify the input to your search engine, and then they might be also used to make a decision of how we wanna handle that. The traditional lexical search is just way better if we deal with lots of proper names, some identifiers of your product when the exact, match matters. And on the other hand, if you have questions which are like long tail queries, with lots of words, then it's better to just default to the vector search because that should, handle this this type of of queries, way better.
So you will be building a service around your search stack. You won't be, like, calling the the search engine directly. There will be usually some preprocessing done. And in that case, you can just add another component just to make sure you have the best system to support a specific, scenario. You also mentioned PG vector, the extension to to Postgres, which is also quite popular when it comes to vector search. Yeah. I know it's popular. Many people are just happy with using it, but the thing is that database is not the best place to put an additional workload, which is related to search. If you have just thousands of documents, then that might be just fine because you won't even notice the difference. However, vector indexes are pretty resource intensive. They require lots of memory. And if you just add this additional load on the existing database, like relational database, then at some point, you will see that the system starts to struggle, with performance because majority of the resources will be just consumed by the by the search.
Even though the relational databases are designed to collect and store the transactional data. So the data which is just key for the business, like your customers, your products, the transactions which are made between customers and products, and search engines are more supporting the the system, and they are just really like the most critical part of the application. So that's why we use different systems to support search, and the same the same applies here. If you wanna add vector search capabilities, then it's better to have, like, a separate system that will be doing that effectively, efficiently, and without generating this additional load on your existing components.
[00:15:24] Tobias Macey:
And then another interesting aspect of these vector engines for search capabilities is that they can be used for more than just the RAG and context retrieval use case that LLMs are making more popular. Their original formulation was as actually more of a semantic search capability. And so I'm wondering how you're seeing teams leverage quadrant and vector engines more broadly beyond just the hyped up use cases around RAG and agentic capabilities?
[00:15:59] Kacper Łukawski:
Sure. Yeah. You're absolutely right. When I joined Quadrant in 2022, LLMs were not that broadly adopted yet. Like, ChatDpT was introduced after a few months, and that changed the game completely. Like, right now, I would say 80% of the use cases that we see is somehow related to rack or agents. But our early adopters started implementing semantic search as an alternative, for keyword based search just because they saw it can just handle some different scenarios that wouldn't be able to be solved with traditional, means.
And for example, ecommerce has adopted, semantic search even before the introduction of the language models. Imagine you were running an ecommerce business and you had lots of products with titles and descriptions, but they were all English. And you also wanted to serve people who couldn't speak that language and semantic search with multilingual embeddings became a pretty easy way of how to enable them to still use the system even though they do not speak the same language. On the other hand, there are also lots of people who can't really express their intents using just keywords and you just don't want to ignore a huge part of your audience just because they can't properly put these keywords to find what they need. And in ecommerce, that's that's a key aspect. You want, your users to be able to find what they want to buy. And semantic search enabled different approach to search and that was so broadly adopted. But right now, we also see lots of people use vector search for different for different scenarios.
Maybe let me just briefly speak about the basics of vector search. In vector search, we have an embedded model, and this embedded model can take any input of obviously, an embedded model will straight on some sort of data modalities like text, images, or videos, for examples. And text is just the most typical problem that we solve. So then if we have this model, it can take virtually any text and convert it into a fixed dimensional vector, which is just list of floats, some numbers. And the these vectors has a useful property of if two different vectors describe a similar object or sample, they should be close to each other in that vector space, and we measure the similarity using, for example, cosine similarity.
So if the similarity is high, we assume that those observations are just related to each other somehow. Okay. And since we have a similarity measure, like cosine similarity, we can also try to use that measure to detect some samples which are out of the distribution of of the data we want to support. And vector search is nothing new. This is actually a pretty good old k nearest neighbors algorithm, but just approximated. And this algorithm is pretty versatile. It might be used for anomaly detection, for example, in medical domain. Also, if you just want to experiment and see what are the clusters in your in your data, you could perform a classification just by a simple voting procedure because you can select the most similar examples, which should be labeled already.
So it solves classification and regression at the same time. So this is actually a pretty good method to solve multiple different problems that were typically solved with separate means. And, obviously, that all might be used combined with large language models too. Like, if you have a system that accepts natural language like query from users and you see there is a new query sent to to the to the to the application, and this query is just pretty far away from all the past observations you had, that should also trigger a human in their loop because maybe somebody is just trying to perform a prompt injection attack and try to use your LLM credits to maybe create some structure output and just attack the system in that way. And if you already calculate the embeddings, you can also try to to to perform this anomaly detection using the same, the same vectors. So this is really versatile method, and we see many people use that not only for rack or agentic proposal, but also to support this traditional problems that we had in machine learning in the past right now just with vectors.
[00:20:36] Tobias Macey:
Now in terms of the work to be done as far as integrating the vector engine into a broader system context and managing the data loading and transformation for creating the embeddings and maintaining them and managing the retrieval. Obviously, there's a lot of work that has gone into that, and I know that one of the areas of focus that you have is in building and maintaining the model context protocol server for Quadrant. And I'm wondering if you can talk to some of the ways that the introduction of MCP as a protocol has simplified the work to be done or how it is integrated into that overall flow of information from managing your unstructured sources, loading it into Quadrant, and then also managing it for retrieval? And I know it also unlocks the or it simplifies the work of using Quadrant as a store for contextual memory.
[00:21:31] Kacper Łukawski:
Yeah. Sure. So, surprisingly, I wouldn't recommend using the MCP server just to ingest the data into the system. There are many other tools that might be used for that. And, for example, Airflow being the the my preferred one probably just because the MCP servers are supposed to act as some sort of plugins so you can connect different tools to your LLMs. And it doesn't make much sense. Like, you have more possibilities if you just use the SDK directly. Like, vector databases have a lot of optimizations that might be, used to make it cheaper, faster, or just more accurate. And in that case, if you just use the MCP server, you are losing this ability or you still need to use both. So it doesn't make much sense. There are plenty of other, other tools that you can use out of the box, like Airflow or some cloud tools if you prefer. However, MCP has enabled people to use the data that you ingest for your pipelines in various applications.
So there might be like a very heavy ingestion pipeline that will be just taking lots of unstructured data and sending it to quadrant. And on on the, other hand, there will be like lots of people connecting to the same quadrant instance and using the ingested data for their own purposes. Because quadrant here acts as a knowledge base and, this knowledge base, may contain different things that might be just some sort of data which is specific to your business or it might be just a set of code snippets coming from different projects you created within your organization and depending on, how you would like to use it. And MCP servers, at least our MCP server is not designed to support the ingestion part. Of of course, there is such a tool. You can use that tool if you, let's say, connect your MCP server with cloud desktop and you would like to build a personal knowledge base that will just store all your memories.
But I strongly encourage you to use a different tool for the ingestion pipeline. This is just not the best way of how to integrate with all these tools, and that applies not only to ours our server, but also to many other ones. Surprisingly, we also modified our existing MCP server in a way that allows you to run it in read only mode so the LLM won't be even able to store anything inside your quadrant collection. So I feel like that should be the preferred way of using it except for some very niche use cases.
[00:24:04] Tobias Macey:
So in terms of the role of the MCP server in the overall system architecture and the ways that data engineers should be thinking about the context management within Quadrant. What are some of the best practices that you found around how to structure the data in a way that is conducive to the retrieval elements? Obviously, chunking is is more of an art than a science currently. Embedding models are generally going to give you different results based on the particular domain that you're in. And I'm wondering how you're seeing teams think about that end to end flow and some of the ways that the MCP server maybe reduces the barrier to entry for the consumer side to help with some of the evaluation work to be done?
[00:24:54] Kacper Łukawski:
Yes. That's a great question. So definitely, chunking is one of the the problems we need to, solve once we decide to implement vector search. And there are no easy answers here. Like, obviously, the default settings in all the popular frameworks are not the best ones. I mean, you can obviously try to use the simplest means possible, like set, window size and try to divide your documents into chunks of that size with some overlap between them so you do not lose the context. But in many cases, context is not only about a particular piece of text, but it might be derived from, let's say, we work with some documents like PDFs. They are just they have some formatting like headings. And adding these headings to your chunks usually helps to understand the overall context of of that particular piece. And, on the other hand, if you work with code, you would like to, create a system that would be searching over your code base, then a particular loop with some, with some function calls do not say that much about the role of that particular piece of code in the whole application.
So in that case, it's better to maybe keep the name of the class. This particular method belongs to the name of the method itself, the parameters that are sent to it along with that real, body of of this particular method. And also, like, including the documentation strings can also help with that. I really like the paper from Anthropic, actually, the an article, not a paper, about contextualized chunking. And they presented a pretty interesting idea of how to use LLM to summarize the role of a particular piece of the document in the context of the whole document.
So LLMs are actually pretty pretty good for that, especially if we deal with traditional PDF like documents because they can easily summarize the the documents. And then if we have a summary of the whole document, we can also clearly state the role of this particular channel that we create. But, yeah, there are no easy answers, as I said, and that really depends on the on the data you have. In many cases, ChangIn is, just an experiment that you need to conduct, and evaluation is key here. But good good news is that evaluation in information retrieval is nothing new. And if you are able to, test your search pipelines, I I assume you do if you, think about it seriously.
Then you could also use the same means to evaluate the quality of of retrieval in case of vector search because you use the same metrics, the same tools, and nothing is new except for the paradigm that you use to to search.
[00:27:40] Tobias Macey:
One of the interesting aspects of the current era that we find ourselves in is that software and data engineering for the past few decades has been very deterministic based on standard practices that are provable using testing or theorem provers and have standardized architecture. And it seems that the introduction of large language models has generalized the need for the adoption and expertise in experimentation that data scientists have been dealing with for a long time now. And I'm wondering how you're seeing data teams in particular come to grips with that aspect of the work to be done where they need to be more in the loop of experimentation, tracking the results of those experiments, being able to iterate quickly on versioning their different chunking strategies or deploying different embedding models and managing the reembeddings because we're not at a point anymore where we can say, okay. This piece of the overall workflow is done because the workflow is constantly changing as new capabilities and new models come out. And I'm just wondering how you're seeing that dynamic play out in data teams in particular.
[00:28:55] Kacper Łukawski:
Yeah. Evaluation is key. Definitely, data teams should be should be focusing on evaluating multiple pieces here. For sure, choosing the embedding model is an important component of that. I'm not a big fan of just testing out all the embedding models that exist and carefully watching the new released models and trying to use them. There are different ways of how you can improve the quality of search than just taking the the best, according to the benchmarks, the best model that exists and trying to re ingest all the documents you have. Let's be honest with with that. If you have millions of documents, then creating the vectors for all these documents will take you a lot of time. Even if you can just scale up your environments, that would be expensive if you just take the biggest model that exists. And evaluation is key. Sometimes we can sacrifice some of the precision just for the sake of of, effectiveness and efficiency.
So what I usually try to to convince our users too is that they should just take the smallest model that does the job on an acceptable level. And I'm a big fan of using the the small models. They are easy to fine tune, and there are plenty of ways that you can, just increase the quality of search just after creating the embeddings. But, yeah, let me just summarize, like, what are the aspects that we need to evaluate when we built reach value metric generation or anything related to LLMs that has this vector search component. So, basically, choosing the harmonic model and choosing the it based on some internal benchmarks, not the public benchmarks that we can easily find like MTEP. MTEP is the most popular benchmark that presents the the quality of the retrieval for different embedded models. And the thing is that some companies can easily just train the models so they shine in these benchmarks, but none of these benchmarks will have the data that might be just specific to your to your own business. And in that case, doing this evaluation on your own is key. But on the other hand, if we use LLMs in the whole pipeline that we also need to evaluate the quality of the outputs of the LLMs, so there are just two models to evaluate separately.
Like, if we are confident in our retrieval phase, we are sure that the embedded model does, its, its job. Then we also need to evaluate the quality of the outputs of the LLM. If it's struggling with answering the question, even though the context provider is just fine to answer it, then choosing a different one might be important. And this is a challenge, actually. Like, evaluating the retrieval is easy compared to evaluating the LLMs. There are various metrics, but there is no consensus yet of how to do it properly. And we typically just take a better LLM to evaluate the worse or the smaller one, that we want to use. Or if we can't use the SaaS, tool, then we use it to evaluate the quality of, on premise, LLM.
So this is also a challenge. But, yeah, many experiments have to be done in if you really want to build a system that will be able to to work with your data, and definitely, it's not gonna be deterministic. I mean, LLMs might be working fine with 99% of the cases, but at some point, you also need to make sure that you are able to trace back all the errors that may occur and monitor the use of the LLMs in some way. Observability here, is a really important component that many teams forget about.
[00:32:33] Tobias Macey:
In terms of the evolution of these systems, particularly when you're dealing with the embeddings as a corpus of context for a rag model or an agentic use case, As new embedded models come out, as you change trunking strategies, you can potentially balloon the overall storage size of the vectors that you're storing. And I'm wondering how you're seeing teams manage the life cycling of those embeddings and understanding which ones are being used, when they can age out older embeddings, and just some of that overall management of the evolution of these systems without necessarily just letting costs grow unbounded.
[00:33:17] Kacper Łukawski:
Yeah. That's definitely important, but it's not something that you would get for free. Like, it's not built into any database, vector database that I know. It's basically something that you need to monitor on your own. But the thing is that if you have an application then running already in production, then you rarely just experiment with new embeddings in that production environment. There is typically an, process of evaluating them offline that you do on just a fraction of your data on some ground truth dataset that you are able to build. And then once you have have proofed that the new embedding model, for example, works well in that scenario, then you can just swap it with a new one. But, yeah, this is important thing that the cost of running vector search might be high because this is so memory intensive. That might be mitigated by different, means, like, various ways of how to optimize the storage so it doesn't cost you a fortune to to run, semantic search. But in general, you rarely just experiment with dozens of different embedding models in production because this is just too expensive to to do. And if you have a production system that you would like to improve the quality of, you can experiment with some other techniques.
Like, for example, instead of just retrieving the context using this, single vector embedding, You can also try to use a re ranker. Just take some more candidates in this initial retrieval phase and then try to re rank them so the better results just pop up, in the top of the results. And that's a pretty easy way that doesn't require you to change the structure of your collections in the vector database. And, yeah, various means that you can you can use in order to make it better. But, yeah, these experimentations are are not done in a living system, but it's more like a research that has to be done beforehand.
And changing the analytic model is actually an important decision. And yet not that many teams do that that often.
[00:35:26] Tobias Macey:
As you have been building and iterating on the model context protocol server for Quadrant and working with the Quadrant community and coming to grips with the constant evolution in the space. What are some of the most interesting or innovative or unexpected ways that you're seeing the combination of MCP and Quadrant applied?
[00:35:47] Kacper Łukawski:
Well, I wouldn't say it's that surprising, but when people got excited about the wipe coding, like, everybody started to build applications over the weekend, but the MCP servers might be pretty useful to perform something that I like to call grounded wipe coding. I mean, if you work in a company that has lots of different projects and you wanna keep up the standards, then you definitely just don't want to wipe code an application and let the LLM to do whatever it wants, but you want to have, like, a knowledge base that will understand all the projects that you have. It may also contain, like, code snippets so you can reuse them, or maybe you have a very specific fronted framework that you use in order to keep all the applications you create consistent.
And actually that became a pretty standard way of using our MCP server. People use tools such as Cursor, Windsurf, argument code, or even Versus Code nowadays, and combine it with quadrant that acts as a knowledge base for this grounded wipe coding. So they put all the all the code snippets, for example, here. And then if they ask the, their agent, the coding assistant to create something brand new, it will start with trying to find some some similar code snippets from different projects. So maybe it's not necessary to create it from scratch, but maybe we can just use some of the components. Or maybe it can just point you that there might be like a library, internal library within the company that you can use for that specific purposes.
So you don't reinvent the wheel. And that's not that surprising, but pretty interesting, area that we feel will be just growing because that's the majority of of users, is trying to do now. And surprisingly, I I didn't really anticipate this huge success of the MCP. I thought it will be only adopted by Anthropic, even though we just released this MCP server, like, few days after it was announced. And then we were just astonished by how many people started to use it for different things and by the fact that OpenAI, Google, also started to incorporate this protocol in their products. And that was actually a a moment we realized this is more important than we thought.
[00:38:08] Tobias Macey:
And in your work of helping to build and guide and interact with the community around Quadrant and experimenting with these various vector use cases, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:38:26] Kacper Łukawski:
I think many people still deal with, the the cost of running vector search. And, surprisingly, not that many know that there are plenty of options of how to optimize it. I mean, methods such as quantization can can help to reduce, the cost of running vector search greatly. Like, one surprising thing is that if you use relatively high dimensional model, like let's say over a thousand dimensions, just not that rare nowadays, then your model will be typically, compatible with binary quantization. So you can reduce the memory footprint by up to 32 times and make it faster by up to 40 times. And, yeah, many people that started building vector search were just dealing with this this problems. And when they just enabled binary quantization on their collections, on their vectors, things were started to to to behave way better. So this is, a common challenge that I see in the community. And there's also, not that much of understanding of what vector search really is. I mean, people really think that vector search will magically magically solve their problems and just enable search over any data that they have. And, unfortunately, it's not that easy. Like, you you really need to have a a model that supports your data, and you can't just take the the simplest one that's available out there. You really need to choose it wisely. And since I'm based in Poland and I spoke to so many, companies here in Poland, I realized there are just many people trying to use the default ones that that come with blockchain or any other, framework that that exists in that space. And for example, they take open AI embeddings because they were just the default ones for a very long time and try to use them for non English data. That's okay. In some cases, I I think, like, that works pretty well with German, but it's just not documented anywhere. There are just various options. Like, there are various model providers that can officially support multiple languages at the same time. But if you just take whatever exists as a default, then you will just come to the conclusion that vector search doesn't work. And that's a homework that everyone has to do, just choosing the embedding model wisely. And I'm not even saying about the evaluation, like building ground ground of datasets and running this whole evaluation process processes. But it's more about just read the model card on hacking phase or on the model provider's website and see whether your particular language is supported by the model you selected. I can strongly recommend using coherent models here if you deal with multilingual, data because they claim to support over a 100 languages at the same time with a really great quality. But there are also multilingual open source models available out there, for example, in sentence transformers.
And this is quite typically a challenge. And on that note, we all know that fine tuning LLMs is a pretty expensive thing, and not all the companies around can afford to fine tune their models to to the very specific problem they have. But contrary to that, fine tuning the embedding models is relatively easy and cheap. And even if you can't find the ideal model that would be supporting all the cases you need to support, fine tuning it is relatively easy, and that might be done on a very limited hardware in actually no time. It doesn't require you to have, like, a cluster of GPUs. Maybe a single GPU with a good base embedding model will be enough to just adjust the model so it works better in your specific domain with your specific terminology.
And, yeah, and that's actually something that many mature teams started to do at some point when they started to struggle with the pretrained models that they just started to use from the very beginning. So I really encourage to have a look at fine tuning the embedded models, because there are lots of interesting materials on that, and it's not that, complicated as it may see at the first glance.
[00:42:29] Tobias Macey:
And as people are designing their systems, they are evaluating different approaches to context management, vector storage, vector search. What are the cases where you would say that MCP and or Quadrant are the wrong choice?
[00:42:49] Kacper Łukawski:
So definitely MCP shouldn't be used for the data ingestion if you really deal with lots of data. That's something we've already, mentioned. But they're great if you just want to connect multiple clients to the same instance of the of your vector database. So let's say your CTO can use cloud desktop and still search over the data easily and use that to extend the context of the prompts. So this is great. And on the other hand, your developers can also connect to the same collection if they use this AI coding agents. So, that's definitely a good choice if you really deal with lots of different clients that would like to connect to that, same, quadrants collection.
But on the other hand, there are many cases in which vector search is not the best choice. I think we have discussed that that example already, but if you have a running search system and if you see that majority of your queries come from the users that can speak the same language like as you do, and I don't mean a foreign language, but they use the same terminology as people who create the datasets, and also they can express themselves in a concise way. For example, they can provide a very specific product identifier they are interested in buying because you are providing a tool for the domain experts. In that case, vector search based on dense embeddings do not make much sense. It's maybe better to just use the traditional lexical search. And, yeah, this is actually an edge case, but I also remember one of our users just started to use Quadrant as if it was a regular database.
So so, they were not putting any vectors, but they were mostly using it as if it was a MongoDB, database. You can technically technically do it. That's not the best choice because obviously search engine is not something you would like to to use as your primary data store. And but let me think about it. I feel like we should definitely evaluate every single case, individually because if you work with search and if you see that there are some cases you can't support with the existing means, then vector search is usually a a good alternative to that. And quadrant is a really efficient, vector search engine. So so maybe, we we would be we would be able so maybe we would be able to to help to improve the quality of the search results easily.
[00:45:18] Tobias Macey:
And as you continue to build and iterate on the quadrant technology as well as the MCP server for it, what are some of the things you have planned for the near to medium term or any particular use cases that you're excited to explore?
[00:45:34] Kacper Łukawski:
Sure. So so first of all, we started the MCP server as a template just to show people how to build their own MCP servers that would be connecting to quadrant. And to our surprise, people started using them as if they were just regular tools. So we decided to go this dual dual way. So first of all, our MCP server is available as a regular Python library. So we can build your own MCP server that will be connecting to Quadrant. You can add some additional functionalities if you prefer to. And on the other hand, you can just use it as a as it is, as a tool that you would use to as a as a gateway to your existing Quadrant instance. And one angle that we are currently exploring exploring is related to code generation, this grounded vibe code that I mentioned or crowded, coding with the use of AI agents, we actually want to create some separate MCP servers that will be handling the code search specifically.
So they will use embedded models that were trained on code. So they should handle the code search way better than the traditional general purpose embedded models that that we used in the in this base, MCP server. So that's definitely something that we are exploring. And, yeah, we are open to discussion. Definitely, we don't want to expose an MCP server that will be just overloaded with different tools. So for example, you can manage your quadrant instance or scale it up from the chatbot interface. We believe that's not the way to go even though there are some other MCP servers that try to expose this administrative tasks to to the LLMs.
And one thing that I'm currently, having a look at is the support of different parts of the protocol in different tools. We haven't mentioned that already, but actually MCP is not only a simple tool calling. But MCP server has differ, can have different resources or prompts, for example. So they might be also used not only for tool calling, but maybe as a a source of truth for the prompts, you know, that work in very specific cases. And I believe that might be pretty useful, for cogeneration. However, the adoption is not not so so wide here. Like, many of the existing MCT clients focus solely on the tools. So they treat the model context protocol as if it was just a fancy way of performing the cool tool calling. Hence, I just believe that they should start to incorporate or incorporate some additional, parts of the protocol, like the possibility for the MCP server to call the LLM back or just to ask for some clarifying questions because that's something that we definitely need in order to be able to support really various scenarios. So I believe that's the direction we'd like to go in the future, and cold search is just the beginning of it.
[00:48:32] Tobias Macey:
Are there any other aspects of the work that you're doing on the MCP server for Quadrant or Quadrant itself or the overall space of vector engines that we didn't discuss yet that you'd like to cover before we close out the show?
[00:48:47] Kacper Łukawski:
I think there's, different aspects of the model context protocol because, model context protocol is not only about the tools, but also about resources, prompts, and this abilities to query the LLM. But I think we covered that already. So so that that would be it probably.
[00:49:02] Tobias Macey:
Okay. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:49:18] Kacper Łukawski:
I feel still many companies can't just I'll be speaking about the applications of LLMs in terms of data management because I feel like that's something I feel confident about. But I feel that many companies can't just easily take the SaaS tools and bring their data into the LLMs so they can perform these these processes. And, definitely, we still lack some hybrid cloud capabilities so we can write, run products with the ease of cloud on our own premises. That's actually something that we've done in quadrant. We have a hybrid cloud offering, so you can just bring your own Kubernetes cluster and run quadrant, using the UI that you would get with managed cloud. And on the other hand, we don't have access to your infrastructure at all. It's only like one way communication, so we can scale it up, but we won't be able to see any of your data. And I I feel that enabled the adoption of vector search in many, of our customers because they couldn't just use the the managed cloud easily. They didn't want to host the open source version of their premises because they wanted to have the support that comes with the cloud offering. And hybrid cloud was actually a game changer.
I feel like not that many providers have this kind of capability, especially when we speak about batch language models or embedded models, at least some of them. It's not that easy to bring them to any, corporate because of that reason. If we want to build data pipelines, then definitely that's something that we still miss.
[00:50:52] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Quadrant and on the MCP server for it. It's definitely a very interesting project. It's definitely one that I'm seeing get a lot of adoption. I'm actually using it for some of my own use cases. So I appreciate all the time and energy that you and the rest of the team are putting into that, and I hope you enjoy the rest of your day. Thank you. Thanks for the invitation. That was a real pleasure. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast dot com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and colleagues.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details. Your host is Tobias Macy, and today I'm interviewing Kasper Lukovsky about how MCP servers can be paired with vector databases to streamline processing of unstructured data. So, Kasper, can you start by introducing yourself?
[00:00:59] Kacper Łukawski:
Of course. Hello. My name is Kasper Lukovsky, and I'm a senior developer advocate at Quadrant. We are building a vector database that supports many applications which are related to large language models, but not only. And retrieve augmented generation is probably the most typical use case nowadays.
[00:01:18] Tobias Macey:
And do you remember how you first get started working in data?
[00:01:21] Kacper Łukawski:
Yeah. Of course. I have a software engineering background. And in one of my previous jobs, I used to work as a software developer, and we started building big data pipelines at this point. That was probably around 2014 or '15. And we build a couple of projects in the automotive industry, which were using Spark and all the Apache, tools that were popular back then like Kafka, Uzi, and many, many more. And that actually was, like, a natural transition, for me to start building this kind of solutions, including not only data ingestion, but also data visualization and business intelligence parts.
[00:02:03] Tobias Macey:
And so now digging into the current frenzy around the applications of data, how to make use of it, how to use it to power these various AI applications, obviously, large language models have had a drastic impact on the utility and applications of unstructured datasets, which have largely been stuffed off to the side and used for bespoke purposes or used as training corpus for natural language processing tasks. But with the capabilities and the scale that large language models offer, we can now turn those into usable assets for various applications, whether that's business analytics, but more generally for language model applications.
And And I'm wondering if you can just start by talking through some of the challenges that you're seeing teams face in building the pipelines that are necessary to be able to take that corpus of unstructured data and turn it into usable data assets.
[00:03:06] Kacper Łukawski:
Yes. Of course, there's been a massive, impact on how we build data pipelines for unstructured data if we use LLMs. And I feel like one of the things that we still forget about is that language models are not going to magically solve all the problems with our data, and they do not have any capabilities to fix it in any way. And, from my per experience, there are many teams struggling with bringing this data because they still don't understand don't understand the nature of language models, which might be making errors. It's not like a typical application where we write code. We can test it thoroughly. With LLMs, things are a little bit different because we can figure out the way of how to process data using it, using these models and then face some issues because this is not going to work on all the cases that we have. And quite a typical enterprise case is that people have lots of scanned documents or PDFs, and they want to bring them somehow, to their applications.
And there are various ways of of how to do that. Like the selection of a proper large language model is key here or visual language model because we want to interpret images here. But still, there are challenges related to scalability, to the deployment of these models, especially if we work in an industry that can't just use, proprietary SaaS based tools. Then the other teams start to struggle with with setting all the pieces up.
[00:04:34] Tobias Macey:
And in terms of the destination of those unstructured assets into some usable data assets, what are the typical shape that you're seeing teams use as that destination point where, unstructured sources, whether that's turning it into tabular data or extracting numerical data or potentially turning it into graph representations using something like named entity recognition. And I'm wondering what are some of the common applications that you're seeing teams use those LLMs for in terms of that transformation?
[00:05:17] Kacper Łukawski:
Yes. So definitely graph RAG is becoming popular. For example, we have just, finished a case study with our one of our users, and they built a pretty interesting system that was using LLMs to derive ontologies given some unstructured data. So that was applied to some restricted domains like law and medicine. And they were actually doing a pretty interesting system that was able to understand the relationship in the data, and they used dual modeling approach. So they not only they have vector embeddings used for capturing the semantics of the data, but also graphs to capture the relationships between different entries. And I feel like this is kind of a typical scenario nowadays. Like, everyone is speaking about graph rack. But when it comes to all the other destinations, yeah, actually, LLMs simplified a lot of problems that we are dealing with in the past, like name entity recognition, text classification, translations as well. So so there are various applications. Obviously, we had some other methods in the past. So, like, algorithms trade solely for a specific problem. Right now, LLMs became the factor standard for solving all of these problems at the same time. And yet, in my experience, vector databases are typically the destination for all the data they process because our users typically want to build some sort of search system that will power their agents or maybe just the search bar on the website. But this is actually the the, most interesting case for our users, how to finally start deriving some insights from the data that was not searchable over in the past. So, yeah, I would say this is the most important application.
[00:07:01] Tobias Macey:
And then in terms of the modeling that is involved in the vector storage of these systems, obviously, vector databases as a category have seen a massive growth in terms of their adoption and attention over the past two years because of the introduction of LLMs and the use cases around embeddings and rag. But there are also vector extensions to other styles of database engines, one of the more popular ones being PG vector. And so then there's the consideration about what are the additional metadata fields or contextual elements that you want to collocate with these embeddings. And so, obviously, Quadrant is more of a document store.
PG vector is an extension to Postgres, which sits alongside relational data. And I'm wondering how you're seeing teams think about the design elements about what the broader context and utility is that they want to get out of the storage medium beyond just the ability to have some means of storing these n dimensional arrays?
[00:08:09] Kacper Łukawski:
Yeah. So first of all, I would distinguish databases from search engines because that might be kind of confusing because I wouldn't call Quadrant a document store. It's more like a search engine. So if we would be looking for an analogy here, it's more like an elastic search that we use for keyword or lexical based search in the past. And this is just an alternative to to does to that paradigm that can also capture the semantics through the vector embeddings. And, yeah, like, quite a typical, way of storing these vectors is to also keep the input data that was used to create them somehow. And the idea is that if we speak about text, then we usually just put this particular chunk that was used to create a vector inside the metadata because in quadrant, every single point can have even multiple vectors and some sort of JSON like metadata.
And this metadata can actually contain the re the original data that that can be used to reproduce the process of creating the embedding. But in many cases, we just keep the reference to the original data, which is just stored somewhere else. That might be your relational database or data like sometimes like a URL to a file because vector search is actually not only about text, but it can also make search over images, video, or audio data possible. That was impossible in the past. So, yeah, just to submit up. So the typical approach is to just keep the original data close to your vectors because obviously, you will need it as these vectors do not capture, like, obvious, unique of the of the of the information. And the typical approach is to rather to use this vector for search, and they then take the raw data and pass it somewhere else, like to the LLM if you build rack, or maybe just expose this original data, the documents to the user if you build just search.
[00:09:58] Tobias Macey:
And then once you do have your data in this embedded format, you have the vectors, you have some reference to the original data, then there's the challenge of figuring out what is the retrieval method that you want to use where vector indexes play a substantial role. But to your point, with Quadrant being a search engine as well, I know it supports the BM 25 index versus the Lucene index that Elasticsearch relies on. And so for people who already have some investment in search infrastructure as they're starting to do that analysis of how do I want to now store my vector embeddings, whether they use the capabilities in their Elasticsearch or OpenSearch clusters versus migrating at least the vector pieces to Quadrant or migrating all of their search infrastructure to Quadrant, how they manage that evaluation and the, key decision factors that go into that?
[00:10:52] Kacper Łukawski:
Yeah. Obviously, that's a common question. And, well, there are many companies that invested a lot of resources and time into this kind of search engines, and they eventually also added this vector search capabilities. But the thing is that vector databases as a category were built solely to support dense embedding vector search and they are just way more efficient because the data structures that we use for dense vector search are totally different from the ones that we use for lexical search based, for example, min 25. So there are different ways of how you can incorporate vector search into your existing, retrieval pipeline. For example, you could obviously use the existing system and then just add these capabilities.
And, there are some differences in how we implemented this vector search compared to this, to this traditional systems, so to say. I remember in the past, they were just not scaling that well because they just had a single segment. So that was totally fine to use unless you are already dealing with millions of embeddings because then you could just store it in a single machine and the cost of running that would be just enormously huge. And all the vector databases were actually built with efficiency in mind. So we have implemented that as a first functionality and Beam 25 is or sparse vectors are just the addition to that, and vector search is still the key component. And if you have, like, a running search system and if you would like to experiment with vector search, it's relatively easy to just build a service that will be connecting to both of the systems. So we will have a great system that will be doing the lexical search with all the heuristics involved, and then you will have the most efficient vector search engine that will support the semantics because retrieval is actually very hard problem.
And sometimes you would be, handling your queries in a completely different way depending on their length or their semantics actually. LLMs are also quite, often being used for that. So so LLMs might be used to classify the input to your search engine, and then they might be also used to make a decision of how we wanna handle that. The traditional lexical search is just way better if we deal with lots of proper names, some identifiers of your product when the exact, match matters. And on the other hand, if you have questions which are like long tail queries, with lots of words, then it's better to just default to the vector search because that should, handle this this type of of queries, way better.
So you will be building a service around your search stack. You won't be, like, calling the the search engine directly. There will be usually some preprocessing done. And in that case, you can just add another component just to make sure you have the best system to support a specific, scenario. You also mentioned PG vector, the extension to to Postgres, which is also quite popular when it comes to vector search. Yeah. I know it's popular. Many people are just happy with using it, but the thing is that database is not the best place to put an additional workload, which is related to search. If you have just thousands of documents, then that might be just fine because you won't even notice the difference. However, vector indexes are pretty resource intensive. They require lots of memory. And if you just add this additional load on the existing database, like relational database, then at some point, you will see that the system starts to struggle, with performance because majority of the resources will be just consumed by the by the search.
Even though the relational databases are designed to collect and store the transactional data. So the data which is just key for the business, like your customers, your products, the transactions which are made between customers and products, and search engines are more supporting the the system, and they are just really like the most critical part of the application. So that's why we use different systems to support search, and the same the same applies here. If you wanna add vector search capabilities, then it's better to have, like, a separate system that will be doing that effectively, efficiently, and without generating this additional load on your existing components.
[00:15:24] Tobias Macey:
And then another interesting aspect of these vector engines for search capabilities is that they can be used for more than just the RAG and context retrieval use case that LLMs are making more popular. Their original formulation was as actually more of a semantic search capability. And so I'm wondering how you're seeing teams leverage quadrant and vector engines more broadly beyond just the hyped up use cases around RAG and agentic capabilities?
[00:15:59] Kacper Łukawski:
Sure. Yeah. You're absolutely right. When I joined Quadrant in 2022, LLMs were not that broadly adopted yet. Like, ChatDpT was introduced after a few months, and that changed the game completely. Like, right now, I would say 80% of the use cases that we see is somehow related to rack or agents. But our early adopters started implementing semantic search as an alternative, for keyword based search just because they saw it can just handle some different scenarios that wouldn't be able to be solved with traditional, means.
And for example, ecommerce has adopted, semantic search even before the introduction of the language models. Imagine you were running an ecommerce business and you had lots of products with titles and descriptions, but they were all English. And you also wanted to serve people who couldn't speak that language and semantic search with multilingual embeddings became a pretty easy way of how to enable them to still use the system even though they do not speak the same language. On the other hand, there are also lots of people who can't really express their intents using just keywords and you just don't want to ignore a huge part of your audience just because they can't properly put these keywords to find what they need. And in ecommerce, that's that's a key aspect. You want, your users to be able to find what they want to buy. And semantic search enabled different approach to search and that was so broadly adopted. But right now, we also see lots of people use vector search for different for different scenarios.
Maybe let me just briefly speak about the basics of vector search. In vector search, we have an embedded model, and this embedded model can take any input of obviously, an embedded model will straight on some sort of data modalities like text, images, or videos, for examples. And text is just the most typical problem that we solve. So then if we have this model, it can take virtually any text and convert it into a fixed dimensional vector, which is just list of floats, some numbers. And the these vectors has a useful property of if two different vectors describe a similar object or sample, they should be close to each other in that vector space, and we measure the similarity using, for example, cosine similarity.
So if the similarity is high, we assume that those observations are just related to each other somehow. Okay. And since we have a similarity measure, like cosine similarity, we can also try to use that measure to detect some samples which are out of the distribution of of the data we want to support. And vector search is nothing new. This is actually a pretty good old k nearest neighbors algorithm, but just approximated. And this algorithm is pretty versatile. It might be used for anomaly detection, for example, in medical domain. Also, if you just want to experiment and see what are the clusters in your in your data, you could perform a classification just by a simple voting procedure because you can select the most similar examples, which should be labeled already.
So it solves classification and regression at the same time. So this is actually a pretty good method to solve multiple different problems that were typically solved with separate means. And, obviously, that all might be used combined with large language models too. Like, if you have a system that accepts natural language like query from users and you see there is a new query sent to to the to the to the application, and this query is just pretty far away from all the past observations you had, that should also trigger a human in their loop because maybe somebody is just trying to perform a prompt injection attack and try to use your LLM credits to maybe create some structure output and just attack the system in that way. And if you already calculate the embeddings, you can also try to to to perform this anomaly detection using the same, the same vectors. So this is really versatile method, and we see many people use that not only for rack or agentic proposal, but also to support this traditional problems that we had in machine learning in the past right now just with vectors.
[00:20:36] Tobias Macey:
Now in terms of the work to be done as far as integrating the vector engine into a broader system context and managing the data loading and transformation for creating the embeddings and maintaining them and managing the retrieval. Obviously, there's a lot of work that has gone into that, and I know that one of the areas of focus that you have is in building and maintaining the model context protocol server for Quadrant. And I'm wondering if you can talk to some of the ways that the introduction of MCP as a protocol has simplified the work to be done or how it is integrated into that overall flow of information from managing your unstructured sources, loading it into Quadrant, and then also managing it for retrieval? And I know it also unlocks the or it simplifies the work of using Quadrant as a store for contextual memory.
[00:21:31] Kacper Łukawski:
Yeah. Sure. So, surprisingly, I wouldn't recommend using the MCP server just to ingest the data into the system. There are many other tools that might be used for that. And, for example, Airflow being the the my preferred one probably just because the MCP servers are supposed to act as some sort of plugins so you can connect different tools to your LLMs. And it doesn't make much sense. Like, you have more possibilities if you just use the SDK directly. Like, vector databases have a lot of optimizations that might be, used to make it cheaper, faster, or just more accurate. And in that case, if you just use the MCP server, you are losing this ability or you still need to use both. So it doesn't make much sense. There are plenty of other, other tools that you can use out of the box, like Airflow or some cloud tools if you prefer. However, MCP has enabled people to use the data that you ingest for your pipelines in various applications.
So there might be like a very heavy ingestion pipeline that will be just taking lots of unstructured data and sending it to quadrant. And on on the, other hand, there will be like lots of people connecting to the same quadrant instance and using the ingested data for their own purposes. Because quadrant here acts as a knowledge base and, this knowledge base, may contain different things that might be just some sort of data which is specific to your business or it might be just a set of code snippets coming from different projects you created within your organization and depending on, how you would like to use it. And MCP servers, at least our MCP server is not designed to support the ingestion part. Of of course, there is such a tool. You can use that tool if you, let's say, connect your MCP server with cloud desktop and you would like to build a personal knowledge base that will just store all your memories.
But I strongly encourage you to use a different tool for the ingestion pipeline. This is just not the best way of how to integrate with all these tools, and that applies not only to ours our server, but also to many other ones. Surprisingly, we also modified our existing MCP server in a way that allows you to run it in read only mode so the LLM won't be even able to store anything inside your quadrant collection. So I feel like that should be the preferred way of using it except for some very niche use cases.
[00:24:04] Tobias Macey:
So in terms of the role of the MCP server in the overall system architecture and the ways that data engineers should be thinking about the context management within Quadrant. What are some of the best practices that you found around how to structure the data in a way that is conducive to the retrieval elements? Obviously, chunking is is more of an art than a science currently. Embedding models are generally going to give you different results based on the particular domain that you're in. And I'm wondering how you're seeing teams think about that end to end flow and some of the ways that the MCP server maybe reduces the barrier to entry for the consumer side to help with some of the evaluation work to be done?
[00:24:54] Kacper Łukawski:
Yes. That's a great question. So definitely, chunking is one of the the problems we need to, solve once we decide to implement vector search. And there are no easy answers here. Like, obviously, the default settings in all the popular frameworks are not the best ones. I mean, you can obviously try to use the simplest means possible, like set, window size and try to divide your documents into chunks of that size with some overlap between them so you do not lose the context. But in many cases, context is not only about a particular piece of text, but it might be derived from, let's say, we work with some documents like PDFs. They are just they have some formatting like headings. And adding these headings to your chunks usually helps to understand the overall context of of that particular piece. And, on the other hand, if you work with code, you would like to, create a system that would be searching over your code base, then a particular loop with some, with some function calls do not say that much about the role of that particular piece of code in the whole application.
So in that case, it's better to maybe keep the name of the class. This particular method belongs to the name of the method itself, the parameters that are sent to it along with that real, body of of this particular method. And also, like, including the documentation strings can also help with that. I really like the paper from Anthropic, actually, the an article, not a paper, about contextualized chunking. And they presented a pretty interesting idea of how to use LLM to summarize the role of a particular piece of the document in the context of the whole document.
So LLMs are actually pretty pretty good for that, especially if we deal with traditional PDF like documents because they can easily summarize the the documents. And then if we have a summary of the whole document, we can also clearly state the role of this particular channel that we create. But, yeah, there are no easy answers, as I said, and that really depends on the on the data you have. In many cases, ChangIn is, just an experiment that you need to conduct, and evaluation is key here. But good good news is that evaluation in information retrieval is nothing new. And if you are able to, test your search pipelines, I I assume you do if you, think about it seriously.
Then you could also use the same means to evaluate the quality of of retrieval in case of vector search because you use the same metrics, the same tools, and nothing is new except for the paradigm that you use to to search.
[00:27:40] Tobias Macey:
One of the interesting aspects of the current era that we find ourselves in is that software and data engineering for the past few decades has been very deterministic based on standard practices that are provable using testing or theorem provers and have standardized architecture. And it seems that the introduction of large language models has generalized the need for the adoption and expertise in experimentation that data scientists have been dealing with for a long time now. And I'm wondering how you're seeing data teams in particular come to grips with that aspect of the work to be done where they need to be more in the loop of experimentation, tracking the results of those experiments, being able to iterate quickly on versioning their different chunking strategies or deploying different embedding models and managing the reembeddings because we're not at a point anymore where we can say, okay. This piece of the overall workflow is done because the workflow is constantly changing as new capabilities and new models come out. And I'm just wondering how you're seeing that dynamic play out in data teams in particular.
[00:28:55] Kacper Łukawski:
Yeah. Evaluation is key. Definitely, data teams should be should be focusing on evaluating multiple pieces here. For sure, choosing the embedding model is an important component of that. I'm not a big fan of just testing out all the embedding models that exist and carefully watching the new released models and trying to use them. There are different ways of how you can improve the quality of search than just taking the the best, according to the benchmarks, the best model that exists and trying to re ingest all the documents you have. Let's be honest with with that. If you have millions of documents, then creating the vectors for all these documents will take you a lot of time. Even if you can just scale up your environments, that would be expensive if you just take the biggest model that exists. And evaluation is key. Sometimes we can sacrifice some of the precision just for the sake of of, effectiveness and efficiency.
So what I usually try to to convince our users too is that they should just take the smallest model that does the job on an acceptable level. And I'm a big fan of using the the small models. They are easy to fine tune, and there are plenty of ways that you can, just increase the quality of search just after creating the embeddings. But, yeah, let me just summarize, like, what are the aspects that we need to evaluate when we built reach value metric generation or anything related to LLMs that has this vector search component. So, basically, choosing the harmonic model and choosing the it based on some internal benchmarks, not the public benchmarks that we can easily find like MTEP. MTEP is the most popular benchmark that presents the the quality of the retrieval for different embedded models. And the thing is that some companies can easily just train the models so they shine in these benchmarks, but none of these benchmarks will have the data that might be just specific to your to your own business. And in that case, doing this evaluation on your own is key. But on the other hand, if we use LLMs in the whole pipeline that we also need to evaluate the quality of the outputs of the LLMs, so there are just two models to evaluate separately.
Like, if we are confident in our retrieval phase, we are sure that the embedded model does, its, its job. Then we also need to evaluate the quality of the outputs of the LLM. If it's struggling with answering the question, even though the context provider is just fine to answer it, then choosing a different one might be important. And this is a challenge, actually. Like, evaluating the retrieval is easy compared to evaluating the LLMs. There are various metrics, but there is no consensus yet of how to do it properly. And we typically just take a better LLM to evaluate the worse or the smaller one, that we want to use. Or if we can't use the SaaS, tool, then we use it to evaluate the quality of, on premise, LLM.
So this is also a challenge. But, yeah, many experiments have to be done in if you really want to build a system that will be able to to work with your data, and definitely, it's not gonna be deterministic. I mean, LLMs might be working fine with 99% of the cases, but at some point, you also need to make sure that you are able to trace back all the errors that may occur and monitor the use of the LLMs in some way. Observability here, is a really important component that many teams forget about.
[00:32:33] Tobias Macey:
In terms of the evolution of these systems, particularly when you're dealing with the embeddings as a corpus of context for a rag model or an agentic use case, As new embedded models come out, as you change trunking strategies, you can potentially balloon the overall storage size of the vectors that you're storing. And I'm wondering how you're seeing teams manage the life cycling of those embeddings and understanding which ones are being used, when they can age out older embeddings, and just some of that overall management of the evolution of these systems without necessarily just letting costs grow unbounded.
[00:33:17] Kacper Łukawski:
Yeah. That's definitely important, but it's not something that you would get for free. Like, it's not built into any database, vector database that I know. It's basically something that you need to monitor on your own. But the thing is that if you have an application then running already in production, then you rarely just experiment with new embeddings in that production environment. There is typically an, process of evaluating them offline that you do on just a fraction of your data on some ground truth dataset that you are able to build. And then once you have have proofed that the new embedding model, for example, works well in that scenario, then you can just swap it with a new one. But, yeah, this is important thing that the cost of running vector search might be high because this is so memory intensive. That might be mitigated by different, means, like, various ways of how to optimize the storage so it doesn't cost you a fortune to to run, semantic search. But in general, you rarely just experiment with dozens of different embedding models in production because this is just too expensive to to do. And if you have a production system that you would like to improve the quality of, you can experiment with some other techniques.
Like, for example, instead of just retrieving the context using this, single vector embedding, You can also try to use a re ranker. Just take some more candidates in this initial retrieval phase and then try to re rank them so the better results just pop up, in the top of the results. And that's a pretty easy way that doesn't require you to change the structure of your collections in the vector database. And, yeah, various means that you can you can use in order to make it better. But, yeah, these experimentations are are not done in a living system, but it's more like a research that has to be done beforehand.
And changing the analytic model is actually an important decision. And yet not that many teams do that that often.
[00:35:26] Tobias Macey:
As you have been building and iterating on the model context protocol server for Quadrant and working with the Quadrant community and coming to grips with the constant evolution in the space. What are some of the most interesting or innovative or unexpected ways that you're seeing the combination of MCP and Quadrant applied?
[00:35:47] Kacper Łukawski:
Well, I wouldn't say it's that surprising, but when people got excited about the wipe coding, like, everybody started to build applications over the weekend, but the MCP servers might be pretty useful to perform something that I like to call grounded wipe coding. I mean, if you work in a company that has lots of different projects and you wanna keep up the standards, then you definitely just don't want to wipe code an application and let the LLM to do whatever it wants, but you want to have, like, a knowledge base that will understand all the projects that you have. It may also contain, like, code snippets so you can reuse them, or maybe you have a very specific fronted framework that you use in order to keep all the applications you create consistent.
And actually that became a pretty standard way of using our MCP server. People use tools such as Cursor, Windsurf, argument code, or even Versus Code nowadays, and combine it with quadrant that acts as a knowledge base for this grounded wipe coding. So they put all the all the code snippets, for example, here. And then if they ask the, their agent, the coding assistant to create something brand new, it will start with trying to find some some similar code snippets from different projects. So maybe it's not necessary to create it from scratch, but maybe we can just use some of the components. Or maybe it can just point you that there might be like a library, internal library within the company that you can use for that specific purposes.
So you don't reinvent the wheel. And that's not that surprising, but pretty interesting, area that we feel will be just growing because that's the majority of of users, is trying to do now. And surprisingly, I I didn't really anticipate this huge success of the MCP. I thought it will be only adopted by Anthropic, even though we just released this MCP server, like, few days after it was announced. And then we were just astonished by how many people started to use it for different things and by the fact that OpenAI, Google, also started to incorporate this protocol in their products. And that was actually a a moment we realized this is more important than we thought.
[00:38:08] Tobias Macey:
And in your work of helping to build and guide and interact with the community around Quadrant and experimenting with these various vector use cases, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:38:26] Kacper Łukawski:
I think many people still deal with, the the cost of running vector search. And, surprisingly, not that many know that there are plenty of options of how to optimize it. I mean, methods such as quantization can can help to reduce, the cost of running vector search greatly. Like, one surprising thing is that if you use relatively high dimensional model, like let's say over a thousand dimensions, just not that rare nowadays, then your model will be typically, compatible with binary quantization. So you can reduce the memory footprint by up to 32 times and make it faster by up to 40 times. And, yeah, many people that started building vector search were just dealing with this this problems. And when they just enabled binary quantization on their collections, on their vectors, things were started to to to behave way better. So this is, a common challenge that I see in the community. And there's also, not that much of understanding of what vector search really is. I mean, people really think that vector search will magically magically solve their problems and just enable search over any data that they have. And, unfortunately, it's not that easy. Like, you you really need to have a a model that supports your data, and you can't just take the the simplest one that's available out there. You really need to choose it wisely. And since I'm based in Poland and I spoke to so many, companies here in Poland, I realized there are just many people trying to use the default ones that that come with blockchain or any other, framework that that exists in that space. And for example, they take open AI embeddings because they were just the default ones for a very long time and try to use them for non English data. That's okay. In some cases, I I think, like, that works pretty well with German, but it's just not documented anywhere. There are just various options. Like, there are various model providers that can officially support multiple languages at the same time. But if you just take whatever exists as a default, then you will just come to the conclusion that vector search doesn't work. And that's a homework that everyone has to do, just choosing the embedding model wisely. And I'm not even saying about the evaluation, like building ground ground of datasets and running this whole evaluation process processes. But it's more about just read the model card on hacking phase or on the model provider's website and see whether your particular language is supported by the model you selected. I can strongly recommend using coherent models here if you deal with multilingual, data because they claim to support over a 100 languages at the same time with a really great quality. But there are also multilingual open source models available out there, for example, in sentence transformers.
And this is quite typically a challenge. And on that note, we all know that fine tuning LLMs is a pretty expensive thing, and not all the companies around can afford to fine tune their models to to the very specific problem they have. But contrary to that, fine tuning the embedding models is relatively easy and cheap. And even if you can't find the ideal model that would be supporting all the cases you need to support, fine tuning it is relatively easy, and that might be done on a very limited hardware in actually no time. It doesn't require you to have, like, a cluster of GPUs. Maybe a single GPU with a good base embedding model will be enough to just adjust the model so it works better in your specific domain with your specific terminology.
And, yeah, and that's actually something that many mature teams started to do at some point when they started to struggle with the pretrained models that they just started to use from the very beginning. So I really encourage to have a look at fine tuning the embedded models, because there are lots of interesting materials on that, and it's not that, complicated as it may see at the first glance.
[00:42:29] Tobias Macey:
And as people are designing their systems, they are evaluating different approaches to context management, vector storage, vector search. What are the cases where you would say that MCP and or Quadrant are the wrong choice?
[00:42:49] Kacper Łukawski:
So definitely MCP shouldn't be used for the data ingestion if you really deal with lots of data. That's something we've already, mentioned. But they're great if you just want to connect multiple clients to the same instance of the of your vector database. So let's say your CTO can use cloud desktop and still search over the data easily and use that to extend the context of the prompts. So this is great. And on the other hand, your developers can also connect to the same collection if they use this AI coding agents. So, that's definitely a good choice if you really deal with lots of different clients that would like to connect to that, same, quadrants collection.
But on the other hand, there are many cases in which vector search is not the best choice. I think we have discussed that that example already, but if you have a running search system and if you see that majority of your queries come from the users that can speak the same language like as you do, and I don't mean a foreign language, but they use the same terminology as people who create the datasets, and also they can express themselves in a concise way. For example, they can provide a very specific product identifier they are interested in buying because you are providing a tool for the domain experts. In that case, vector search based on dense embeddings do not make much sense. It's maybe better to just use the traditional lexical search. And, yeah, this is actually an edge case, but I also remember one of our users just started to use Quadrant as if it was a regular database.
So so, they were not putting any vectors, but they were mostly using it as if it was a MongoDB, database. You can technically technically do it. That's not the best choice because obviously search engine is not something you would like to to use as your primary data store. And but let me think about it. I feel like we should definitely evaluate every single case, individually because if you work with search and if you see that there are some cases you can't support with the existing means, then vector search is usually a a good alternative to that. And quadrant is a really efficient, vector search engine. So so maybe, we we would be we would be able so maybe we would be able to to help to improve the quality of the search results easily.
[00:45:18] Tobias Macey:
And as you continue to build and iterate on the quadrant technology as well as the MCP server for it, what are some of the things you have planned for the near to medium term or any particular use cases that you're excited to explore?
[00:45:34] Kacper Łukawski:
Sure. So so first of all, we started the MCP server as a template just to show people how to build their own MCP servers that would be connecting to quadrant. And to our surprise, people started using them as if they were just regular tools. So we decided to go this dual dual way. So first of all, our MCP server is available as a regular Python library. So we can build your own MCP server that will be connecting to Quadrant. You can add some additional functionalities if you prefer to. And on the other hand, you can just use it as a as it is, as a tool that you would use to as a as a gateway to your existing Quadrant instance. And one angle that we are currently exploring exploring is related to code generation, this grounded vibe code that I mentioned or crowded, coding with the use of AI agents, we actually want to create some separate MCP servers that will be handling the code search specifically.
So they will use embedded models that were trained on code. So they should handle the code search way better than the traditional general purpose embedded models that that we used in the in this base, MCP server. So that's definitely something that we are exploring. And, yeah, we are open to discussion. Definitely, we don't want to expose an MCP server that will be just overloaded with different tools. So for example, you can manage your quadrant instance or scale it up from the chatbot interface. We believe that's not the way to go even though there are some other MCP servers that try to expose this administrative tasks to to the LLMs.
And one thing that I'm currently, having a look at is the support of different parts of the protocol in different tools. We haven't mentioned that already, but actually MCP is not only a simple tool calling. But MCP server has differ, can have different resources or prompts, for example. So they might be also used not only for tool calling, but maybe as a a source of truth for the prompts, you know, that work in very specific cases. And I believe that might be pretty useful, for cogeneration. However, the adoption is not not so so wide here. Like, many of the existing MCT clients focus solely on the tools. So they treat the model context protocol as if it was just a fancy way of performing the cool tool calling. Hence, I just believe that they should start to incorporate or incorporate some additional, parts of the protocol, like the possibility for the MCP server to call the LLM back or just to ask for some clarifying questions because that's something that we definitely need in order to be able to support really various scenarios. So I believe that's the direction we'd like to go in the future, and cold search is just the beginning of it.
[00:48:32] Tobias Macey:
Are there any other aspects of the work that you're doing on the MCP server for Quadrant or Quadrant itself or the overall space of vector engines that we didn't discuss yet that you'd like to cover before we close out the show?
[00:48:47] Kacper Łukawski:
I think there's, different aspects of the model context protocol because, model context protocol is not only about the tools, but also about resources, prompts, and this abilities to query the LLM. But I think we covered that already. So so that that would be it probably.
[00:49:02] Tobias Macey:
Okay. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:49:18] Kacper Łukawski:
I feel still many companies can't just I'll be speaking about the applications of LLMs in terms of data management because I feel like that's something I feel confident about. But I feel that many companies can't just easily take the SaaS tools and bring their data into the LLMs so they can perform these these processes. And, definitely, we still lack some hybrid cloud capabilities so we can write, run products with the ease of cloud on our own premises. That's actually something that we've done in quadrant. We have a hybrid cloud offering, so you can just bring your own Kubernetes cluster and run quadrant, using the UI that you would get with managed cloud. And on the other hand, we don't have access to your infrastructure at all. It's only like one way communication, so we can scale it up, but we won't be able to see any of your data. And I I feel that enabled the adoption of vector search in many, of our customers because they couldn't just use the the managed cloud easily. They didn't want to host the open source version of their premises because they wanted to have the support that comes with the cloud offering. And hybrid cloud was actually a game changer.
I feel like not that many providers have this kind of capability, especially when we speak about batch language models or embedded models, at least some of them. It's not that easy to bring them to any, corporate because of that reason. If we want to build data pipelines, then definitely that's something that we still miss.
[00:50:52] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Quadrant and on the MCP server for it. It's definitely a very interesting project. It's definitely one that I'm seeing get a lot of adoption. I'm actually using it for some of my own use cases. So I appreciate all the time and energy that you and the rest of the team are putting into that, and I hope you enjoy the rest of your day. Thank you. Thanks for the invitation. That was a real pleasure. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast dot com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and colleagues.
Introduction to Vector Databases
Challenges in Processing Unstructured Data
Applications of Large Language Models
Vector Databases vs. Search Engines
Evaluating Search Infrastructure
Beyond RAG: Semantic Search Applications
Integrating Vector Engines into Systems
Experimentation and Evaluation in Data Teams
Managing Embeddings and System Evolution
Optimizing Vector Search Costs
Future Directions for Quadrant and MCP