High Performance And Low Overhead Graphs With KuzuDB

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?

DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems.

Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed

price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold

to book a demo and see how they turn months long migration nightmares into week long success stories. Your host is Tobias Macy, and today, I'm interviewing Prashanth Rao about Kuzu DB, an embeddable graph database. So, Prashanth, could you start by introducing yourself?

Yeah. Hi, Tobias. Hi, everyone. I'm Prashanth. I'm an AI engineer at Kuzu, a modern embedded graph database.

I also lead developer relations at Kuzu. And I actively interface with Kuzu's user community to showcase how to use Kuzu in a variety of AI and data engineering tasks.

And do you remember how you first got started working in data?

Absolutely. Yeah. So my background is actually quite varied. So I'd like to start a bit, from from the beginning. I have two master's degrees, one in aerospace engineering from the University of Michigan and another one in computer science from Simon Fraser University in Canada. In between these two degrees, I worked in applied physics, where I was running numerical simulations for, computational fluid dynamics tasks. And here I built a lot of computational work experience and the ability to formulate problem statements

and clearly work through an analytical solution. In my computer science master's degree, I specialized in NLP and machine learning, where I was learning about classical machine learning techniques in NLP.

But just then, the BERT model came about. This was in 2018,

and transformers were just taking off in the, AI world. So in the middle of my program, I landed an internship as an AI engineer at the Royal Bank of Canada, and my first exposure to graphs and graph databases was at this job.

So I was using Neo four g at the time to build a proof of concept graph application

where we were trying to do client prospecting. And I was using combination of techniques from NLP, information extraction, and other methods

to gather data from various sources to build knowledge graphs. And I found this entire process really fascinating.

So in between, I worked in a couple of other companies once I finished my program, in health care and manufacturing,

both with graphs and other database systems. And I've done a ton of work in regular data engineering, like most people do, I guess. But it's also been interspersed with a lot of AI and machine learning tasks. And I found Kuzu where I currently work when I met our CEO, Simi, at a local meetup in Toronto in Canada. And after that, I was just hooked. I I continued engaging with the Kuzu team in open source. And a few months later, I realized that our visions were aligned. So I joined Kuzu later that year in 2024

to spread the word and get more people excited about it, just like I am.

And so digging into Kuzu itself, can you give a bit more of a context of what the actual project is and some of the story, at least as far as you understand it, behind how it got started and, I guess, a little bit more about why you, in particular, want to spend your time and energy focused on it?

Definitely. So, yeah, Kuzu is, as I mentioned, a modern embedded graph database. It's designed to make it very easy for developers to work with graphs. Kuzu began more than eight years ago as an academic research project, actually, at the University of Waterloo, where our CEO, professor Seni Salihoglu,

with one of his students,

published a survey on the state of graph databases in the market. And the aim was basically to understand what gaps existed in the graph database market. And from the findings of that study,

it was clear that there was a lot of performance shortcomings in existing solutions,

especially for analytical large scale graph query workloads. So in 2019 and '20, when DuckDV,

which many people may be familiar with, is also an embedded database but in the relational world,

DuckDV began kicking off around then. And there were enough evidence in the space that, the fundamental principles of these analytical or OLAP style systems worked well in practice. Zakuzo builds upon years of experience from this, the academic research that went into it, and from the lessons learned from other analytical systems,

like TopDB, MonetDB, Umbra.

And it brings a lot of their principles, like columnar storage and a lot of other compression techniques and things that are well known into the graph database space. Zakuzu was open sourced in 2022

as an open source project, MIT licensed. And the company was founded in early twenty twenty three. I can also, briefly talk about some of the key features of Kuzu, in a way that differentiates it from other graph databases that you may be familiar with. So Kuzu's architecture, as I mentioned, it's very similar to DuckDV and SQLite. It's an embedded database, meaning that it runs in process. And the database is just a file on disk. It uses columnar disk based storage, and it combines this with a lot of other innovations like vectorization,

factorization,

and a lot of novel join algorithms. And because everything is built on top of disk based operators,

Kuzu scales to really large graphs. So we've tested Kuzu on billions of nodes and edges, graphs on a single machine. And the other main thing that differentiates Kuzu is it's heavily focused on the usability aspect. It does use the property graph data model and the Cypher query language, which were both popularized by Neo4j many years ago. But the combination of performance, scalability,

and the ease of use, plus the fact that it's open source, at least in my view, makes it a compelling choice for a wide range of graph applications.

And you mentioned

the

embeddable aspect of it, which makes it very, as you said, easy to get started with, which is very appealing for a number of use cases.

But there are also

a number of situations where you would want to have a more server oriented

graph engine, which is where Neo four j has become sort of the de facto standard in the ecosystem.

There's a lot more activity currently happening in the graph space as a result of practices such as graph rag and its applicability

to LLM use cases because of the

Yep. So, yeah, there's a lot to unpack there. The first thing, we need to

Yep. So, yeah, there's a lot to unpack there. The first thing, we need to point out is that, yeah, open source Kuzu is embedded and runs in process. But we are actively building an enterprise version that has a server on top of it to exactly deal with some of the enterprise features that people have come to expect from systems. And part of it is that, yes, certain use cases do benefit from having a server on top of the graph databases,

especially when you have a lot of concurrent reads and writes happening, which is very common in many use cases. So, yeah, that being said, open source Kuzu is without a server. It runs as a file on disk in your machine. But, there is an enterprise version that kind of addresses some of these, you could say limitations and scales it up to enterprise use cases. And to answer your question about the different use cases that Kuzu is used for, would you like to go into some of the common scenarios in which we see how Kuzu is used?

We could dig into that in a little bit. But I guess just more broadly,

I'm interested in your perspective as somebody who's building a graph engine,

what the broader ecosystem of graph database adoption looks like. Because for a long time, it has been very nascent where there are specific use cases where people will bring in graph engines. But even in contexts where a graph might be the superior representation,

people are still leaning on existing

relational or or even NoSQL engines because of the fact that they already have those

technologies in their application stack, and they don't want to put in the investment to bring in a new type of engine and just some of the ways that you're seeing that potentially shift as LLMs drive more adoption of graph use cases?

Yeah. That's a great question. Yeah. I'd like to highlight that part about, your point where people tend to think of graphs as a very specialized use case or niche use cases.

Historically,

graph databases have been applied in highly specialized domains,

social networks, fraud detection, recommendation systems, and so on. But as you mentioned, the arrival of LLMs has, I think, reinvigorated

the interest in not just graph databases, but in how graphs are used in these applications.

So we are seeing more and more applications where combining the benefits of vector retrieval and graph retrieval

can actually provide better context to these LLMs in graph rack systems and beyond. Context engineering is a very hot buzzword right now. But honestly speaking, graphs are just a great addition to an AI engineer's toolkit these days because inherently, a graph imposes structure on otherwise

unstructured data. And as we know, real world data is a mix of both structured and unstructured data. In fact, it's arguable that most enterprise organizations

have more unstructured data than they do have structured data. So with the capabilities of these modern LLMs and the awesome tooling around these LLMs, it's becoming more and more feasible to transform unstructured data into knowledge graphs. And I I actually have a feeling where we're going to see a lot of, like, a proliferation

of knowledge graphs in the industry in the coming years because we are very easily able to get good structured outputs from these LLMs and transform them into something that resembles a graph. And graph databases are going to be at the core of a lot of these retrieval systems. So the hope is that Kuzu is actually that retrieval engine that people use for these kinds of applications.

And, yeah, we can definitely go into some specific applications of this.

Another element

of the overall positioning of Kuzu that I think is worth unpacking

is the term scalable because that has a lot of potential dimensions along which you can argue for scalability

where it could be scalability in terms of capacity to execute at speed given a particular set of hardware. It can mean horizontal scalability where you can allow for multi node storage and retrieval.

It can allow for scalability in terms of just being able to query across a large number of nodes given certain other constraints. And I'm wondering if you can just give a bit more nuance to the context in which Kuzu is laying claim to that scalability

attribute.

Definitely worth clarifying. So scalability in Kuzu refers to the scenario where you initially build a graph of any size. It could be small, medium, or large, depending on how you define that. And, typically, in the earlier stages of a project, when you are still iterating on the graph data model, you don't want to put your full data set into the graph database. But once that stage is more or less done and the data model is formalized, what we mean by scalable is that from that very same underlying workflow that you had in the initial stages, you can seamlessly scale up the data ingestion as well as query processing to the entire dataset that you have, even if the actual data is of the orders of hundreds of gigabytes or a couple of terabytes of data. And Kuzu's query processor is actually designed to handle data of this size. Because as far as possible, any operations are pushed down to disk. And for expensive workloads, intermediate results are written to temporary files on disk. And they're all assimilated to produce a final output. So we currently implement this approach for large relationship insertions into Kuzu of the order of tens of billions of edges. But we're also expanding this to the query processor, where you have really expensive recursive queries that are touching very, very large portions of billions of nodes. So in a nutshell, scalability, as defined in Kuzu, is scalability on a single machine, where you as a user don't need to worry about performance

impacts as the data gets larger and larger. Because, as you know, disk storage is incredibly cheap. Now comparing this with the traditional notion of scalability, which people tend to use the definition as horizontal scalability,

where you distribute the workloads across several machines, there are definitely other graph databases that do this. But the main challenge with these systems is that they are really expensive to maintain and run over long periods of time. So for these reasons, Kuzu's focus has always been maximizing single machine or single node performance and efficiency. And it does this really well. We have plenty of real world users actively pushing Kuzu workloads to really large graphs, tens of millions or hundreds of millions of nodes and billions of edges, and finding a lot of value in Kuzu's performance in these workloads.

The other typical reason why, at least in an operational context, you would want to have horizontal scalability

is for reliability guarantees or uptime guarantees where if a single machine fails, you have other machines that are able to take over, and then you can re replicate the data on another machine that comes into rotation.

I'm sure that that's probably part of your overall goals for the enterprise system, but I'm just wondering, given the fact that you do have the right ahead log as part of the default operation of Kuzu, what are some of the potentials for being able to consume that right ahead log and use that as a replication mechanism for maintaining a hot standby or something like that for cases where you do want to have high uptime guarantees

with Kuzu as the actual graph engine.

Yeah. Great point. So the as you mentioned, write ahead logs are the core of, Kuzu's, internal operation. Basically, every transaction in Kuzu is ACID compliant.

So it it functions as an ACID compliant database just like any other relational system does. But you're right that the open source version may not include

the reliability and backup guarantees that people may come to expect, but these are all the active features that are being built and implemented for the enterprise edition, which leverages all these underlying features that you mentioned.

So digging now a bit more into potential use cases for Kuzu,

because of the fact that it's embeddable, it allows

for

situations such as edge compute where you want to be able to distribute a single graph for read only use cases to multiple locations. But I'm wondering if you can just talk through some of the types of architectural

design and integration patterns that you're seeing Kuzu applied within?

There we go. Yeah. Yeah. Yeah. So okay. I'm all set now. Yeah. So the way Kuzu is used depends on who's using it and for what application. Data engineers often integrate Kuzu into their ETL workflows

by plugging it into their orchestration framework of choice. Data scientists tend to use just a simple pip install Kuzu to get going right away and interface with their favorite data frame libraries or file formats

where they use Kuzu to analyze their data

via methods like graph algorithms and visualizations.

But another way people use Kuzu is purely in memory. And this can be very beneficial for ephemeral workloads,

where you don't need the persistence guarantees of a database. But but you just want to quickly bring in data from another source.

It could be a file on disk or an API call. And you bring this data into a graph structure, run a graph computation on it, something like a graph algorithm or a recursive join that's really expensive in SQL.

But then you take the results back to another system, which could be a database or a file on disk.

So when operating in memory, Courzu is very fast because it deals with purely in memory data structures.

But the difference in performance is marginal because Courzu on disk is already so fast. And then the other common pattern that we are seeing a lot more of recently is, AI and GraphRack related tasks. The paradigms are rapidly evolving as to how people bring together these different retrieval layers as part of their system.

But the idea is that you want to combine the best of vector and graph retrieval. And, the fact that Kuzu provides a performant graph as well as a vector index,

these are very useful ways for people to build interesting retrieval mechanisms into their LLM based applications. So depending on, again, who's the user and what their use case is, Kuzu applies in a lot of these different scenarios.

And so digging a bit more into the application patterns of Kuzu, you mentioned some

situations such as in the data pipeline where you need to do some sort of relationship extraction or conversion of data into

a graph structure, and maybe you want to just select out a subset of those nodes to then persist in a different storage system or a different format or turn into some sort of maybe feature engineered,

element for an ML pipeline.

The other use case, as you said, is maybe you want to have a persisted

graph that you can then retrieve

for different use cases.

And I'm curious,

particularly for maybe some of the situations where people are pushing the edges of Kuzu,

how you're seeing people deal with some of the more esoteric architectures

where maybe they have

a persisted graph in object storage, and then they have a cached read of that that gets loaded into an application for being able to do I'm gonna use the term real time, but more interactive interactions

for end user applications.

Right. Yeah. So the esoteric means, as you mentioned, I think there's a lot of interesting ways, thanks to the design of Kuzu, that people are using it. I come to your question about the usage that you just mentioned. But the first thing I would like to point out here is because Kuzu is an embedded, that is in process database,

it can run pretty much anywhere. And Kuzu is compiled separately

as a binary for every system architecture that we support, all the way from x eighty six sixty four to ARM 64

to, as you said, embedded in mobile devices that could run Android or iOS. And we've seen people run Kuzu on mobile devices

in in scenarios that we wouldn't expect. Pretty much any device that can run Android, which includes, like, so many different edge devices these days, can benefit from running Kuzu

locally within that system. And one thing is that as LLMs get smaller and cheaper, they're also being pushed to the edge. So it's very interesting to see how people are combining these small, low cost LLMs

and using Kuzu's disk based vector and graph index,

on those systems. It's opening up a whole new host of interesting use cases.

Another very interesting esoteric use case, that we talked about publicly and, actually, it's it's been it's been presented as well is how Kuzu is used in a data platform called Bowplan. I'm not sure if you've used this. But essentially, Bowplan is a modern serverless data platform.

It helps developers build data pipelines declaratively in Python. But what's really interesting about this application is the usage of how Boblan applies graphs under the hood. And in fact, Kuzu is at play in their system. Boblan was the one of the earliest adopters of Kuzu.

They exclusively run Kuzu in memory,

to plan their data workflows for things like dependency management,

or validating their pipeline configurations.

And all of this validation and planning happens in the order of hundreds of milliseconds,

where they are running hundreds of Cypher queries on our temporary in memory Kuzu graph. And once they generate their plan and all the operations are complete, the graph is thrown away. So based on the success story, actually, we are seeing a lot more people ask about

ephemeral graph use cases, and we'd love to see more people build innovative applications in these scenarios where graphs add value.

One of the other interesting aspects of Kuzu is the set of extensions and integrations

that are available.

And

I know

that one of the options is using the h t t p f s so that you can store and retrieve

Kuzu databases

from object storage or other external locations.

It also, similarly to DuckDV,

has the capabilities of integrating with other table formats. So Iceberg, in particular, is one that has gained a lot of attention in recent years. And I'm curious if you can talk to some of

the use cases as well as challenges

of integrating with some of those external data stores such as Iceberg or data lakehouse architectures.

And I I also know that it has support for integrating with Postgres relational stores and just some of the trade offs of being able to

do consumption and transformation

of those either columnar or some of those table oriented storage systems,

converting those into graphs and the patterns that people are using as far as do they then persist those graphs for incremental

updates and use within other contexts, or is it typically something where it is more of an ephemeral use case of I need to be able to do a graph representation of this subset of data, and I do that routinely, but I don't actually want to persist that because of the fact that managing the upgrade you know, updates or incremental changes

are too complex or too expensive, etcetera.

So, yeah, we can talk about the first aspect, which is,

the way Kuzu interfaces with these external storage systems. And the first point to clarify there is that on these external systems, for example, Iceberg or any data lake, Kuzu can directly scan these systems.

But it allows developers to bring the data from these scans into Kuzu. And that could be either in memory or on disk, depending on the nature of the application. But Kuzu maintains the appropriate,

hash indices and CSR based join indices, which basically are the node and edge indices that, power the Cypher query engine on top. And now the data has to first come into Kuzu's native format before these query workloads can be run. So in in some ways, even regardless of whether Kuzu operates in memory or on disk, there is still some degree of ETL happening.

So to maintain data freshness,

you could leverage tools like DAXTER, Prefect, or Airflow, or any of these modern tools to orchestrate the data movement based on the change data logs of what data has evolved in the upstream system. Now there there's some more ideas on this that the core team at Kuzu has been mulling over. And this is related to what's called zero copy operations,

where we're considering

implementing features

where you could essentially run zero copy Cypher on top of relational systems like Postgres or Iceberg. And in this approach, you don't move the data over into Kuzu. You scan the table in the current state, as we currently do. But Kuzu would only manage and create indices on top of those tables. It wouldn't actually copy the data into the Kuzu storage layer. And using this approach, the query processor would operate on the data that is still in the primary source. The data never leaves the primary source. So we did actually have an active branch in our repo where we tested some of these ideas in Postgres. And they definitely show a lot of promise. So it's something that's on our radar. And depending on the interest from the community, this is something we would revisit. And I'm sure there are other teams thinking about this as well. But, yeah, we are actively looking into

making Kuzu function in a much more seamless way with the existing data formats out there.

I think in particular that that's a

an area that is

underserved

in the overall graph market. The only engine that I've come across that offers any sort of similar capability is, I believe, PuppyGraph, which is a paid only product as far as being able to do graph representations

over data that exists in lake house formats.

And I'm wondering if you can talk a bit more to some of the complexities of managing that zero copy index and the translation

of graph into the underlying

tabular representations

and maybe some of the ways that there are opportunities

for extension of things such as the iceberg or Delta formats to have more graph native

primitives available within their table catalog or or table metadata to be able to help with some of that translation effort or help with reducing some of the scan and processing necessary on the actual underlying files.

That's that's really interesting. So I'm sure, you your audience as well as you have followed the updates in the Duck Lake,

format that came about recently

and the kind of upheaval it's causing because it's moving away from the JSON based representation

of table schemas

to, more, like, database based or, like, schema strong approach in that in that format. So

based on all of this, I think

it's important to appreciate and understand

what exactly is happening under the hood when when you bring data into Kuzu and any graph database. So, the the key point here is that as data is ingested in this graph native format,

there are a lot of additional indices being created on the fly. For example, Kuzu uses a hash index where it creates primary keys. You

point to a primary key in the data. Every data set that you bring in needs a unique primary key identifier.

And Kuzu updates on the fly the hash index that points to the primary key column. So all of these are really, really important things to keep track of in the underlying index that is managed by the database. So you're right that the current ecosystem does not tend to have a lot of graph native, or graph friendly features in this space.

But the way we are approaching this, at least currently in integrating with these data lake formats and other systems, is we are actively using Kuzu's extension ecosystem

to directly scan the underlying contents of the tables in these systems.

And and WTB has laid a lot of the foundation in this space by building these excellent connectors. So we leverage our a lot of ideas from what they are doing. But in the larger scheme of things,

we are interested to see how this ecosystem evolves in terms of, are people actually

using graph in the zero copy way? Right? In some cases, it just might make sense to persist the data to a graph database because the data becomes very large.

Processing it in memory can have its own potential pitfalls.

So not all use cases lend themselves really well to this ephemeral, put everything in memory and in zero copy format.

Of course, there are certain use cases that do benefit. But I feel like this is a two way

link between the rest of the ecosystem,

in terms of all these different data formats, the evolution of Iceberg, v two to v three, the all of these upcoming changes that are happening here. I think that will all play back into how we ask Akuzu approach this.

Another interesting

parallel

is the work that has gone into, for instance, the LAANC table format where they are bringing native vector indices

to lakehouse

architectures and the underlying parquet files. And I'm curious what you see as the opportunity to

do something similar in the graph space where maybe you have parquet files as the underlying

store, but you have some more native graph primitives

as the

table metadata format for being able to

cut down on some of the retrieval and processing

necessary for being able to actually

consume subsets of those graphs.

Yeah. That's really an awesome observation. Yeah. I actually follow, LansdDB and the Lance format very closely. It's a very interesting format. I actually I'm a user of LansdDB as well. So it's it's really great to see that kind of innovation happening. But the inherent difference between LAANC as a format and Parquet is that LAANC specializes in random access. And Parquet is specialized at block retrieval of, like, large amounts of, rows or columns at a time. So they're both columnar systems. But, Parquet essentially lends itself really well to graph style ingestion as well as query workloads. And the good thing about Parquet, because of the maturity of the ecosystem and all the innovation going on around,

the stuff the the tooling build on top of it, So we're seeing a lot of, emphasis on the Arrow connectors.

As you know, Parquet is basically the disk persistent format of Arrow, which is the in memory spec, that decides how data is stored in memory for columnar storage layers. So Kuzu actively leverages this Arrow ecosystem. In fact, if you open the Kuzu GitHub and check out our issue list, you'll find a lot of new issues related to using this Arrow format much more efficiently.

And part of this is also to do with what we're doing for our enterprise version, but then it ties back into the open source version. The underlying benefit of relying on the arrow representation

is that because Kuzu is a strongly typed system, unlike many of the graph databases out there, the data that is brought into Kuzu has a strict schema where types are known beforehand.

Now, initially, when users come over from other systems to Kuzu, they're surprised by this. But then it becomes very obvious why this is a benefit for multiple reasons, mainly because if you know the types of things beforehand, you can make a lot of good assumptions about how to process that very, very efficiently at scale and add compression techniques and a bunch of other things. Combine this with the maturity of the Arrow ecosystem and and the parquet format and everything around that means that you're able to leverage the type underlying type system of Arrow. So one great example of this is the way we integrate with the Polars ecosystem. So Polars is a data frame library, and that's also very actively integrated with the arrow format under the hood. So every Polars data frame essentially can be transformed into an underlying arrow table. That arrow table stores the the data types and the exact data format in memory. Kuzu directly can scan the contents of that arrow table. So essentially, a polymers data frame can be natively scanned in Kuzu by accessing the underlying arrow types. Now this is very powerful because

the larger the data ingestion operation becomes, you have millions of rows, maybe hundreds of millions of rows that are coming

in. Operating on batches of records at a time is very, very efficient. So so one very common pattern of using Kuzu, in fact, I use Kuzu a lot this way as well, is using or leveraging the underlying column of formats out there, parquet, iceberg, and everything, and bringing those into an in memory representation,

an arrow table in polars. And you can easily batch those records up. So you you if you have 100,000,000 records, you would batch those in, like, size batches of a million or so. But when you have a underlying polar data frame of that size

and Kuzo natively scans that, you can very rapidly ingest those million records in pretty much no time as opposed to relying on traditional methods where you use for loops in your programming language and, you know, iterate through each row one at a time. So so I think there's a lot of,

synergy between the way Kuzu approaches this whole heavy duty data ingestion, analytical compute on top of the underlying column restores and the data formats out there. Now how this ties into LAANC, I'd be very interested to see how this goes. There already have been some chats in our Discord and on GitHub about potentially

scanning the context of Lance tables directly.

And now we have our Kuzu has its own vector index, as does Lance, obviously, because they're a vector store. So it's very interesting to see if in the future we could leverage some of those formats as well. But but for now, Arrow and Parquet are first class citizens in Kuzu.

And given your investment in the Arrow ecosystem,

another subproject of that that has been gaining a lot of attention is the data fusion engine. And I'm curious what you see as some of the potential for being able to leverage some of that investment

as a translation mechanism for Kuzu as

well. Yeah. That's it's one of my favorite topics. Data Fusion is a library I've been following for a very long time. And, honestly, the it it's amazing what people are building on top of Data Fusion. In fact, LansDB, as I know, is also building on top of Data Fusion because they expose a SQL query engine that,

essentially leverages Data Fusion's operators.

One challenge with Data Fusion as it applies to Kuzu is that the Data Fusion project is essentially SQL first. It is all about SQL. It allows people to build their own custom query engines but leveraging traditional SQL operators.

Now under the hood, deep down, it's true that most graph database systems compile down to physical operators that resemble what SQL has. But as we go higher level up the stack, the features that Data Fusion exposes currently don't lend themselves very well, at least to my understanding,

for what a graph database requires, the kinds of optimizations and the kinds of access we need to the storage layer underneath.

So I'm very interested to see if there's anyone out there who is building some variant

of data fusion or extending data fusion in a way that supports

Cypher based operators or Cypher based expressions.

And maybe that could be something that could explode in open source in the future, the more popular graph databases become. The part of the challenge with graph databases is that unlike the relational ecosystem, there hasn't been that much of a consensus around query languages.

It's been a lot of internal debate among many other vendors out there and the formats being used.

The Cypher is a very popular query language for graph databases, and Kuzu adopted Cypher for this very reason because the user base is quite large. It's pretty popular. People already know how to work with it. And it has a lot of similarities

aesthetically to SQL. But the challenge is that there are other systems out there, other graph databases, that are building their own proprietary formats or their own languages, not actively contributing back to the Cypher ecosystem. So there's been a new standard called GQL,

that's come about. And it's still in the process of being formalized. But it looks like that seems to be the standard around which the GraphQL community is rallying. However, as we know with query languages, that can take a long time. So I am the more and more I think about this, I feel like we are a few years away from this. But, Data Fusion and all the projects that emerge from it, there could be a variant of it that can potentially cater to the future standard GQL, which is a strictly typed query standard or a language standard for graph query languages. So, yeah, I'd love to see where that goes. And maybe, yeah, who knows? We can play back into that ecosystem as well.

Yeah. You you beat me to the punch on that one. I was gonna ask about GQL.

And I I think it's also interesting

given some of the conversation

around interop with things like Parquet and the very column oriented

ecosystem around analytical use cases

and to your earlier point about Kuzu

at its storage layer being graph or sorry,

being column oriented.

And I'm curious if you can talk through a bit more of some of the internal architecture of Kuzu and, in particular,

the benefits and trade offs of that column representation

with the adjacency list indexes

versus

the approach that other graph engines have taken with more graph native primitives at the storage layer?

Definitely. So ultimately, every database, whether or not it's a graph database, it makes a decision between being a row oriented system versus a column oriented system. If you're optimizing for reads or scans or analytical queries,

then it's pretty much a given that a columnar design is better than a row oriented design. And you'll see pretty much all the analytical databases and OLAP systems out there following this column oriented design. The thing with graph databases, the kind of query workloads you see in graphs,

these normally contain really hard or expensive read queries. For example, finding parts between entities or patterns in the data instead of, like, right or transactional oriented workloads. So for these reasons, SkuZoo basically embraces the columnar design. And

it uses, essentially, a lot of innovations from that ecosystem

in the graph database space. Now going into columnar storage as such, the benefit of columnar storage is you're able to process blocks of records at a time than processing one tuple at a time. And this enables a lot of downstream

benefits, which is all queries in Kuzu will now execute in a vectorized fashion. It's able to access these underlying blocks of columnar stores. So it's very similar to how other OLAP systems like DuckDV and ClickHouse do it. The other thing is that queries can also execute in a parallelized manner. We use more server driven parallelism in Kuzu. So combining this with the other benefits of columnar storage, vectorization,

using multiple threads on your device becomes, very simple. The queries become much faster as a result.

And in addition to this, there are some other, novel techniques introduced in Kuzu.

One of these is factorization,

where you get significant performance benefits in, certain kinds of queries, where you have intermediate results, but these are compressed when you evaluate the many to many joints.

And the the result of this using factorization is that you get an exponential reduction in the size of the result being processed at runtime. So, we're seeing a lot of benefits in terms of

certain classes of queries running way faster than they would in traditional graph database systems. And then from a join algorithm perspective, as we know, one of the most expensive kinds of queries that you could run-in a graph database are these recursive joins

or many many hop joins where you are traversing

parts of length greater than two or three or four. And for this, the query planner in Kuzu seamlessly combines

two different join algorithms. One of them is the common binary join, and the other is the worst case optimal join. And these allow different kinds of queries to benefit from this sort of fusion between the two. For example, binary joins work really fast for acyclic style queries. But worst case optimal joins are known to be faster when there is a high cyclic component. So having all these heuristics in place and bringing together all these different innovations at the storage layer as well as a query processing layer is basically what makes Kuzu really, really fast. And as a side note, I also recommend reading some of the techniques, that I've just talked about in much more detail in the Kuzu blog. It's basically blog.kuzudp.com.

The name of the article is why graph databases need new join algorithms. And it was written by our CEO, Seni, who's a professor at University of Waterloo. So, yeah, these are techniques that have been around for quite a while, but bringing them together,

in one single product, into a usable industry product, I think that's where Kuzu really shines.

As you have been

working with Kuzu DB, working with the community around it, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

One of the really interesting use cases that I mentioned to you is the way is using it in ephemeral workloads. So till recently,

the idea or the impression that people have of graph databases

has been that they are these expensive monolithic pieces of software that are essentially alternative to relational systems

who use them in different primary data stores scenarios. But the way we are seeing Kuzu use and the way we want to explain to others is that graph databases are actually not a replacement for relational systems. They are actually a very good complement to not just relational systems, but many other tools out there. They are essentially a different way of retrieval,

mechanism. And that that's basically what we are seeing right now in the different ways people are using graphs. The bulk line use case was an ephemeral load where you actually have no need to persist the data and you just want the query engine or the compute capabilities of the database. Having an embedded architecture really benefits this because now the database can move to where the data is. It doesn't actually have to be this system that is sitting distinctly in a server somewhere. So that's one very interesting application.

The other, very, very interesting use cases we're seeing is pushing graphs to where they have never gone before, things like edge devices or mobile devices. There's a lot of focus on the data privacy

of how data is stored on systems. But then when LLMs are involved, the fact that you're potentially sending private data to an LLM server somewhere remotely and bringing that back. So having, an embeddable architecture, having these unique, innovations for all the the storage and query processing brings these new capabilities to the devices that are in that sandbox. You essentially operate the database and all the tools around them in that sandbox.

So as these models improve, as hardware improves, as we see more performance coming out of these different systems, it's gonna be very interesting to see how people are pushing these. And we're already seeing some innovations around using graphs in very, very low resource devices. That being said, there have always been use cases for graphs where you do need to run very, very large expensive workloads on on billions of nodes. So so that is still very much, I think,

a lot of interesting use cases we are seeing in the community as well as people,

who have been pushing graphs to the limits. So, yeah, I think I think Kuzu applies in such a wide range of scenarios that it's gonna be interesting to see where it picks up in popularity.

And to that point of edge use cases and use cases where you wouldn't typically have a graph database available, I also noticed that you have a WASM

compiled target for Kuzu DB so you could load it into a website and have

a knowledge graph natively as part of the website delivery without having to have it have any server side component for managing the querying and traversal.

And I'm wondering too if you have any interesting,

either applications or ideas of applications

for that particular

compile target as well.

Yeah. Thanks a lot for bringing that up. That's a very important point. So the Wasm build of Kuzu, exactly as you said, it's designed to run in browsers, pretty much any browser. Now the thing there is that, yes, the graph database is embedded in the database in the browser itself, and there is no data ever leaving that existing browser session. So the moment you close the browser and the session, the data is lost. So, in fact, the Kuzu demo, if you go to the website kuzudb.com

and click on the demo link, that's essentially an embedded Wasm database running in that browser session that you in your tab of your browser. So we've actually published a blog on this in our

blog .kuzuribi.com

as well, where, we actually

ran

a local LLM, and it uses the the web GPU that runs in your browser. So, essentially, every browser has a GPU component to it. It can access the internal GPU on your machine. So

the LLM that we use in that scenario was actually able to run, obviously, not a very big model, but it was a very low cost LLM that could run-in the resources available in the browser. Part of the limitation there is that the browser imposes

strict memory constraints on that session. I believe the limit is in the range of four to eight GB or so of memory, which is not that much memory, to be honest. So inherently, the database as such is not the bottleneck.

The graph database runs very, very smoothly in the browser. In theory, you don't need to use LLMs with it. I could see use cases where, let's say, you're in a very, very sensitive environment in, like, a health care organization or a bank or a government organization,

and you have the open source library available to you. You would essentially deploy the Wasm

database in your internal session in that browser. Any data that it sees is guaranteed to be only within that session because the whole idea of the Wasm is that the data never leaves the browser. And, obviously, in those scenarios, the browser would obviously be not be connected to the Internet, and there's a whole set of other security

aspects in place. But the idea is exactly that. There are so many niche use cases where you may want to enclose the operation of the database within that environment. So just like you have edge devices where the data never leave leaves that device,

you also have browser based sessions in in Wasm where the data never leaves the browser. And all of these are, again, unique ways in which Kuzu can be used,

to unlock new use cases.

And in your experience of

contributing to Kuzu, helping to popularize

your overall experiments, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of working on the project?

So I've, spent about two years, just under two years at Kuzu right now, and I've worked in the graph database space, for, I think, almost six or seven years. And what I found the most unexpected in my time at Kuzu is how often graphs are either misunderstood or undervalued.

Now a lot of enterprises think of graphs as this niche or expensive

alternative solution to their existing tools. And I found that people tend to force fit their workloads that would benefit from a graph to some kind of other solution, like a relational tool using SQL. Now my experience shows me otherwise.

When I come across tools like Kuzu, it's pretty simple to make a decision as to when the graph is beneficial for for the use case and when it is not. And you can easily adapt your workflows in a very cost efficient way. Like, you don't need to replace your existing solutions with a graph database. You can complement them with a graph based approach. And this is just as true for things like RAG.

People tend to think of RAG, and they they they treat it synonymously with vector search.

I really I think that's a misnomer. Like, RAG has nothing inherently to do with vector search. Yes. Vector search does help in retrieving data from your underlying store using semantic similarity.

But in inherently, graphs can actually add a lot of value to RAG as well. So so I'm hoping the language around what a graph is and how do you use a graph productively

as part of your system, how do you deploy the graph database, these are all things that I think people tend to misunderstand and and put labels on it. And I think that's where, we are trying our best to communicate these use cases. And that's a lot of what my role is, telling people that, yes. You can use a graph for these scenarios and to complement this other thing that you already were using.

And for people who

are

looking to

leverage graph capabilities

for their particular workload, what are the situations where you would recommend against using Kuzu?

Okay. Yeah. So it's important to know this. First of all, the kind of workload matters.

Any scenario where the workload is inherently write heavy by write heavy, I don't mean doing a lot of writes. Essentially, I'm talking about traditional transactional use cases. For example, if you are an online e commerce platform

and you need to ingest millions and billions of, orders into the website. For those kinds of scenarios, relational systems have been tried and tested. They do their job really well. It really does not make sense to use a graph for those kinds of workloads because they don't benefit from the underlying storage mechanism and the relationships in the data. And, essentially, that ties into how people are using or what kind of queries they are writing. If the query workload is inherently

an aggregation kind of query workload, where you're trying to efficiently compute aggregations of groups grouping on billions of records. Again, relation systems do this really well. They've done it for for decades.

It it only makes sense to introduce a graph into your use case when you find that these other systems have performance limitations or the queries become really, really verbose.

These are where I think, the graph use case really, really shines. One additional thing I I've I've recently noted, because of all the work I've been doing with LLMs and AI, is that the,

idea of using

LLMs to write queries, either SQL or Cypher or any other language. LLMs obviously

are like, they're outputting tokens,

and query verbosity does make a difference.

To express many hop joins in SQL, it is a lot more verbose than the equivalent query in Cypher. So for me, it was very interesting to see the difference in the work that's been done on text to SQL. Text to SQL is basically using an element to write SQL for you versus text to cipher where you use an LLM to write cipher.

And I find that modern LLMs, like, really good LLMs, write really good cipher. Part of it is they have learned some cipher from the amount of training data that's out there. But also the other part of it is that Cypher is not as verbose as SQL for many hop joints.

So certain kinds of queries actually can be very easily written by these modern tools. In a in a like, packaging all of this in into a single statement, I think I really think that graphs are underutilized,

and,

there's definitely a lot of scope now to use graphs and, inherently graph queries in different use cases that were not talked about before.

Yeah. Just briefly enumerating some other applications of graphs, particularly in the data engineering

context that I've come across.

I think one of the more notable ones is in the context of things like entity resolution

for purposes

of master data management.

And I'm curious if you're seeing people using Kuzu in that context because of the fact that it's lightweight and just maybe some of the more

some of the more write once read many situations in that data processing and data

and data discovery context?

Yeah. That's a very, very good question. And in fact, I was going to bring it up if you did note as well. So this ties into my previous point about there's so much unstructured data out there. And we are now at a point in time where we have the tools like LLMs and all the additional,

safeguards around them to bring the data from that unstructured form into a structured form like a graph. So we are, I think, still in the early days of it where we don't yet have that explosion of knowledge graphs out there being created from these different AI based methods.

But the first challenge that people will face when they build these graphs is that you need some form of entity resolution

because LLMs have great contextual understanding of data. But

inherently, in every data set, there is going to be the same entity that is mentioned in different ways. And unless you explicitly ask the model to disambiguate them or do something specific to it, you're going to have graphs with duplicate values in them. So so, in fact, the Kuzu user community is actively engaging with us on this. I seen a lot of users actually not only work on these, but actually build useful tools with this. So as a database company, we actually actively leverage the user community alongside us to not only understand what use cases are happening out there, but also build on top of us. So I think that's the way I see this going. We have a lot of powerful AI tooling around that is happening is innovating at its own pace outside of the database space. But there's plenty of room for innovation for tool builders

to come up with a, let's say, a convenient developer friendly way for organizations to,

extract their insights from unstructured data. And I actually follow a lot of these companies, but, inherently, a lot of them are actually looking at graphs and using Kuzu for these kinds of things. So entity resolution is a very, very, you could say, unsolved or very challenging problem. And I'm very eager to see how we are going to actually apply techniques like this. Of course, it's not universal, and there's a lot of safeguards that need to go in. But I really feel that, yeah, graphs are really, really entering the mainstream,

and we need to really be aware of the implications of having all this data out there in the form of graphs.

And as you continue to build and iterate on Kuzu, we've already talked to some of the forward looking plans, but I'm wondering if you can just share any near to medium term efforts that you're particularly interested or excited in.

So the first and foremost, the most important one is that we are actively building an enterprise version of Kuzu. So first, I would recommend people go to kuzudb.com

to check out for some more information on this. The main purpose of this is obviously to address the feature set that people expect from an enterprise,

setting. Right? Like, all the things related to security, observability,

data backups, and so on. So the idea is that you wanna be able to allow people to self host Kuzu using the enterprise setup in their secure space and have all these features baked in. So so I'm really excited to see how people are actually bringing graphs and graph databases,

into their organization in a way that manages everything. Right? You have cost, you have performance, and you have scalability.

I think the combination of these three is is quite hard to achieve with many other solutions. So so that's the part I'm actually,

really looking forward to see how it evolves.

Also, I wanna say one of the driving ambitions of the Kuzu team is that we want to be the most widely deployed graph database in the market.

SQLite is the most widely deployed database in history.

DuckDV is hugely popular and loved by its user community. So we want to be on that level, but in the graph database space. So we have it's still, I I guess, early days for us compared to these other tools. But, yeah,

looking at the way we've been received and the way people are using graph databases in these all sorts of different applications,

it's quite exciting to see where this is going.

And just one last thought that came to mind

is in the context of things like

SQLite or DuckDV and relational engines in particular.

When you're using them in in an embedded context, it's usually with some host language.

Python is obviously a very popular one. There are a lot of

translation tools for Python objects to SQL in the form of various ORMs.

I'm curious

what you see in the ecosystem

as far as a similar

construct for graph queries for being able to programmatically

generate the appropriate queries, but with a more idiomatic

host representation

in that language context?

Yeah. That's a fair question. So I follow the relational, like, ORM ecosystem pretty closely. I've worked with a lot of tools myself. So I've seen a few implementations of OGMs, object graph mappers.

And now the thing is, they apply to certain scenarios

where you have, you could say,

rule wise or record based operations in a transactional sort of use case. Like, if you use any of the other transactional systems,

ORMs came out of that ecosystem where you are inserting or reading one record at a time. But inherently, an analytical workload does not lend itself well to those use cases. So the more I actually began experimenting or playing with these OGMs and ORMs, I began realizing that when you actually use an analytical database, like, let's say, ClickHouse or DuckTD or any of the other systems out there, the way you write as well as read data is drastically different. And everything in those analytical systems is tuned towards these columnar, like, batch based operations. You do you don't want to handle single records as far as possible, especially during,

writes.

So the the same principles actually apply to Kuzu as well because it's built on those analytical design principles.

And what I'm not saying that Kuzu should not support those kinds of paradigms, but currently, the the usage pattern that we see is that,

we have this parameterized query

representation

in every language that we support.

So, essentially,

just like your one SQL injection,

by having parameterized queries in the query like, the interface on top, the the underlying,

client ecosystem in Kuzu, for all the languages we support, I believe there's seven languages now that we support, each of them has a system where you pass in a parameterized

value so that you don't have, you know, pure strings being passed to your system. So that being said, it is not the same as ORM or OGM where you actually have object mappings. So I'd like to see how this, evolves.

Currently, I see WTB's

ecosystem, and the primary way people interface with them in SQL is this they tend to just write parameterized SQL queries and use these powerful integrations in the ecosystem, like data frames, the same way we do in Fuzu, actually. Like, the recommended way I always say tell people to bring data into Kuzu and query Kuzu and even output data from Kuzu is, at least in Python, leverage the data frame ecosystem.

Data frames are incredibly powerful data structures. They're optimized for not just columnar operations, but very, very

rapid,

efficient in memory compute. So during data ingestion, you have so many benefits by leveraging this data frame columnar structure that you already have at your disposal.

So rather than inserting records or reading them one at a time, it might it makes sense to just offload that responsibility to these powerful tools. So, yeah, I know that it doesn't directly answer your question, but I I I'm pretty sure we're gonna see some, you know, movement on this. OGMs are definitely a thing. I've seen them around. I'd like to see maybe how other people are using them.

And just quickly too on that point of integrating with the data frame ecosystem. I know that Python also has a very rich ecosystem of graph libraries.

Network x is one of the more widely known ones, but there are also a lot of them that are more focused on novel graph algorithms or more performant implementations of graph algorithms. And I'm curious what you see

as some of the translation layer between those

graph focused Python libraries and

persistence and retrieval of something like Kuzu and the the translation to and from Cypher.

Yep. So Kuzu

directly off the bat supports network x graphs. The way it works is when you have a graph that is stored and persisted in Kuzu, you can query it using Cypher, retrieve the subgraph that is of interest, and then directly output that into a network x graph, which essentially outputs Python objects that represent the internals of a network x graph object. Now that's extremely powerful because

the data that you had was sitting on disk in an efficient form compressed in Kuzu.

But on demand, you can run any Cypher query to retrieve a subgraph, translate that into a NetworkX format, run any NetworkX algorithm on it because NetworkX has, I think, pretty much the biggest suite of graph algorithms out there for any library. And then the moment you have the computation of that done, the output of that can be a data frame. And now you use the power of Kuzu's integration with the data frame ecosystem to rapidly ingest those graph algorithm metrics back into the Kuzu graph. So So this is a very common pattern we use for graph algorithms.

That being said, NetworkX is not the most performant method to compute on very, very large graphs. So we are actively working towards, making that a lot simpler. We already have a graph algorithm extension,

in open source Kuzu. You can check that out in the documentation page. But, basically, the goal of the graph algorithm extension in Kuzu is to allow people to directly run graph algorithms. Some of the most popular graph algorithms out there, like PageRank and UVA clustering and so on. All of these are running natively on disk in Kuzu. So the by leveraging Kuzu's internal operators, we are able to run to scale this up to really large graphs. So the goal, obviously, is to support as many graph algorithms as possible. I I believe we currently support six with a few more on the way in the next release. But for any algorithm that is not supported internally within Kuzu,

we hope to make the extension ecosystem also more easily accessible to contributors, external contributors, in a way that there's already a very active con graph algorithm community in the c plus plus world. They actually parallelize graphs at scale. We are wondering if maybe some of those contributors may be interested in plugging into the KooZoo ecosystem.

So so, yeah, there's a lot of, different ways we could approach this. But in a nutshell, network x and native DAF algorithms are the way to go in.

Are there any other aspects of the Kuzu project,

the applications and architectural paradigms around it, or the ecosystem

that is

growing in the overall space of graph applications that we didn't discuss yet that you'd like to cover before we close out the show?

So, I I think we've covered all the big ones, actually. The the the the way graphs are used more and more as context for these AI applications, I think that's a big one we are very actively focused on right now. Of course, not forgetting the traditional use cases of graph databases,

which is large scale computation. So we would love for users to actually use Kuzu in these diverse settings. Like, the I one use case I didn't mention, I think I forgot, is serverless settings on the cloud. So,

essentially, container based systems,

they tend to run ephemerally on, like, serverless systems like Lambda on AWS and their equivalents in other cloud services. So there are use cases we are seeing where people want to run an embedded graph database within that Lambda instance and then shut it down once the computation is done. It's kind of like an in memory,

service, but, actually, it's not because there is a memory allocated in the VM for that serverless instance. So there there's so many different ways, I think, people are using graphs on disk, in memory, on the cloud, locally.

So, yeah, I think we've covered pretty much the the wide horizon of different ways Kuzu enables different use cases here and keeping performance scalability and cost effectiveness, I think, at the forefront of the whole design.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the Kuzu team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Awesome. Yeah. So, again, my, because I work in drafts, my answer is geared towards this topic. I go back to the point about entity resolution

where, I feel like there's so much scope for building user friendly tools for people to do entity resolution at scale. Right now, I don't think there's any at least based on my analysis of the market, there is still room for innovation of,

on on this space as to how you can build a tool to come up with a convenient interface.

And by by interface, I mean, allowing people to bring their data from various sources, not only construct a high quality graph, but then work on these downstream aspects, removing duplicates,

identifying

missing values,

essentially getting high quality graphs to a point where they are actively providing this invaluable context to downstream use cases like RAG and everything else beyond. So, yeah, I think that there is a gap in that space. I I think there's a lot of innovation happening already. So that is one space I'm gonna be tracking very closely.

Alright. Well, thank you very much for taking the time today to join me and share the work that you and the rest of the Kuzu team are doing on bringing this technology to the market and the ecosystem. It's definitely a very interesting project, one that I've been following for a while. So I appreciate the opportunity to discuss it in detail and learn more about its internals and applications. So thank you again for your time, and I hope you have a good rest of your day.

Thank you so much, Tobias. It was great to be on the show.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends

and coworkers.

Data Engineering Podcast