Summary
In this episode of the Data Engineering Podcast Professor Paul Groth, from the University of Amsterdam, talks about his research on knowledge graphs and data engineering. Paul shares his background in AI and data management, discussing the evolution of data provenance and lineage, as well as the challenges of data integration. He explores the impact of large language models (LLMs) on data engineering, highlighting their potential to simplify knowledge graph construction and enhance data integration. The conversation covers the evolving landscape of data architectures, managing semantics and access control, and the interplay between industry and academia in advancing data engineering practices, with Paul also sharing insights into his work with the intelligent data engineering lab and the importance of human-AI collaboration in data engineering pipelines.
Announcements
In this episode of the Data Engineering Podcast Professor Paul Groth, from the University of Amsterdam, talks about his research on knowledge graphs and data engineering. Paul shares his background in AI and data management, discussing the evolution of data provenance and lineage, as well as the challenges of data integration. He explores the impact of large language models (LLMs) on data engineering, highlighting their potential to simplify knowledge graph construction and enhance data integration. The conversation covers the evolving landscape of data architectures, managing semantics and access control, and the interplay between industry and academia in advancing data engineering practices, with Paul also sharing insights into his work with the intelligent data engineering lab and the importance of human-AI collaboration in data engineering pipelines.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm interviewing Paul Groth about his research on knowledge graphs and data engineering
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing the focus and scope of your academic efforts?
- Given your focus on data management for machine learning as part of the INDELab, what are some of the developing trends that practitioners should be aware of?
- ML architectures / systems changing (matteo interlandi) GPUs for data mangement
- You have spent a large portion of your career working with knowledge graphs, which have largely been a niche area until recently. What are some of the notable changes in the knowledge graph ecosystem that have resulted from the introduction of LLMs?
- What are some of the other ways that you are seeing LLMs change the methods of data engineering?
- There are numerous vague and anecdotal references to the power of LLMs to unlock value from unstructured data. What are some of the realitites that you are seeing in your research?
- A majority of the conversations in this podcast are focused on data engineering in the context of a business organization. What are some of the ways that management of research data is disjoint from the methods and constraints that are present in business contexts?
- What are the most interesting, innovative, or unexpected ways that you have seen LLM used in data management?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data engineering research?
- What do you have planned for the future of your research in the context of data engineering, knowledge graphs, and AI?
- Website
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- INDELab
- Data Provenance
- Elsevier
- SIGMOD 2025
- Digital Twin
- Knowledge Graph
- WikiData
- KuzuDB
- data.world
- GraphRAG
- SPARQL
- Semantic Web
- GQL == Graph Query Language
- Cypher
- Amazon Neptune
- RDF == Resource Description Framework
- SwellDB
- FlockMTL
- DuckDB
- Matteo Interlandi
- Paolo Papotti
- Neuromorphic Computing
- Point Clouds
- Longform.ai
- BASIL DB
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy? DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafull to book a demo and see how they turn months long migration nightmares into week long success stories. Your host is Tobias Macy, and today I'm interviewing Paul Groth about his research on knowledge graphs and data engineering. So, Paul, can you start by introducing yourself?
[00:01:06] Paul Groth:
Yeah. Thanks for having me. So I'm a professor at the University of Amsterdam where I lead a research group on intelligent data engineering. So this is really the intersection of how we use AI systems for data engineering and the other way around how we build better data management systems for AI.
[00:01:27] Tobias Macey:
And do you remember how you first got started working in data?
[00:01:31] Paul Groth:
Yeah. So it's kind of interesting. I, I I when I was doing my bachelor's degree, I worked at a AI institute of all things. And then afterwards, I started my PhD, and I started my PhD in distributed computing. And I was working with use cases around high performance computing, and in particular, their provenance or data provenance of the results of high performance computing systems. And in particular, at the time, there was a thing called the grid, which is like a precursor to what we call the cloud now. And then the questions was how you do you track data provenance across these high performance computing systems. And so I started doing things like developing data models for data provenance.
And then, as the career went along, I started getting more and more into data systems. So, when I first moved to the The Netherlands, I started building the first kind of graph databases from different things. So building a large biomedical knowledge graph, what we call biomedical knowledge graph now, called OpenFAX, which was integrating I think we had, like, 20 different databases that we were trying to integrate. So kind of my journey was, okay. We need to do data provenance. Then I got fascinated by, hey. We're integrating these data system data from multiple sources and then started building these kind of big data integration systems.
[00:03:06] Tobias Macey:
And provenance is an interesting one to dig into because I think I first came across that specific terminology in the context of data. I think it was more on the data science side of things, but that was, I wanna say, somewhere around the time frame of 2014 or 2015. And since then, the terminology, at least within the ecosystem of data engineering within the organizational context, has been subsumed by the term lineage. And I'm wondering if you could maybe give some of your interpretation around some of the variance of nuance between those two terms and where one maybe isn't quite a complete superset of the other.
[00:03:48] Paul Groth:
Yeah. I actually think pretty much when you're talking, data provenance and data lineage, you're talking about the same thing. Right? And I think, I think, technically, oftentimes, we think about data lineage and data provenance. There's been an ongoing discussion in, like, for example, research. Are we talking about just things that are happening within your database? Right? So how do rows get transformed, then we build views on top of that. And I think in the industry session and also in the broader context of data provenance, can we track across multiple systems? Right? So a lot of the original work in data provenance and data lineage was on one side really at this core database side of the world tracking within the database, but then there were a lot of people looking at workflow systems, and being able to track those workflow systems and how experimental results, were produced.
And I think what you see now is, you know, you have a lot of focus on more of what I would call the workflow system style provenance, so being able to trace across your your organization. Right? So I think we've broadened out that that term, and I think what we're really interested in is, hey. I got this result. How can I trace back across my whole data estate to where the input results are? And we might use a data catalog for that. We might do some sort of structured logging for that. But I think that's really, you know, going beyond just tracking within, you know, your relational database, for
[00:05:18] Tobias Macey:
And in terms of your academic focus, you mentioned that you're working with the intelligent data engineering lab. Obviously, a lot of evolution of data engineering happens rather organically in a large part, at least from my understanding and experience, has been very industry driven. And so I'm curious if you can give some of the details in terms of the areas of focus and some of the ways that data engineering can be conceived of as an academic pursuit and some of the interesting insights that you're able to gain by virtue of your research that are maybe translatable into some of the day to day of engineers working in that organizational context?
[00:05:58] Paul Groth:
Yeah. So, actually, it's a pretty funny story about how I well, not a funny story, but a story about how I started IndieLab. So I was a academic. I worked at the Free University Amsterdam, and then I actually decided to go work, at Elsevier, which is a large publisher information, content company, and I were was working in their research lab. And at Elsevier, I had a really a fantastic time, and it was really good insights into the practical day to days of managing large scale datasets. What What are the problems that you see, information silos, the problems of semantics getting to an agreement in an organization?
And that's one of the reasons, actually, after a couple years there, I just kept seeing these kind of fundamental problems that we continually have. And could we go back and take a step back and go back and think about research and think about some of those fundamental problems over the long long horizon? And in particular for me, these ideas of how do we do data integration better? How do we work? Do we have some, like, primary mechanisms to work with messy data? Can I tell you rigorously, here's how you do work with messy data? Here's your data integration set up. Can I do that? Can I have a a solution that I can say will work all the time? Can I have some rules about how to do that? So this is one of the reasons I kind of went back into, like, academia because of that motivating factor of all these problems that we kept seeing within, like, the the corporate context.
I think, in general, actually, there's a a bigger conversation between, you know, academia and industry in the field of data engineering or data management. Right? So, I mean, I think if you're working with a SQL database, that's coming from a lot of research in in, you know, back of the papers from COD and IBM. But, also, you see, like, this conversation between, you know, at places like Sigma, between academics and industry happening a lot. So I I think sometimes you see the motivations coming from industry problems and then, you know, academics taking a step back and and having a way to formalize those or come up with efficient ways to do that. Like, I think maybe you've had on this thing, for example, the the emergence of column or databases would be an example of that kind of thing.
Specific to IndieLab, right, so what we're trying to do is work on a number of different kind of thematic areas. One is how to how do we automatically build databases or automatically build knowledge graphs from multiple sources. Right? So can I come to you and say, okay? Here are ways given heterogeneous data. Can I construct you a super high quality dataset that's super useful? Another place that we are working on is what we call context aware data systems. So this is things like, how does your data management system cope with different users, different environments, changes in the dynamics around this so you can think of data systems that are designed for digital twin? And lastly, like, I'm very compelled by data scientists and and data workers.
And so how can we design data management systems for helping with machine learning? So, for example, how can we do deeper data quality assessments? Right? So if you're building some data unit tests, can we tell you, you know, how that, would work? Or, like, a recent paper we had was looking at the data handling impact on machine learning models. So how do you change your datasets when you're doing data prep? How does that impact your machine learning models? And it's kind of surprising. It's more it's maybe what you would expect that it's more than you than you think it would be. Right? So your machine learning models can be very impacted by your data handling and sometimes they're not the way you would expect. Those are some of the areas we're working on, and I think they have messages for, for people in in practice.
[00:10:13] Tobias Macey:
There are a lot of interesting things there to unpack, in particular that context aware data systems. One one of the things that immediately came to mind is the challenges of managing the nuance of access control in data warehouse environments, etcetera, where a lot of the suggested best practice is to use things like attribute based access controls. But then the challenge is, okay. Well, how do you gain access to the appropriate attributes? How do you determine that mapping? And how does that propagate across a constantly shifting set of data models where in the warehouse context, obviously, you want to make sure that you have consistent naming, consistent tagging, but there's typically not a broad guarantee that those are all in place in the way that you want them to be. And so there are numerous points at which you can accidentally leak access because you don't have the right attributes or because you haven't explicitly added the appropriate row level permissions or column level permissions. And I'm wondering what your thoughts are, if there have any been any areas of your research that touch on some of those complexities of identity and access control and managing the contextual elements of who should be able to do what and for what reasons.
[00:11:28] Paul Groth:
Yeah. So we haven't worked a lot on access control per se, but what we do see is exactly what you see is this divergence between, I would call it semantics. Right? So what do we call different attributes in different spaces. Right? So you see this all the time with things like the idea of customer versus person, right, or customer versus user. Right? Is your user your customer? What what's the role there? Or are you a sysadmin, or are you an administrator? Are you a steward? What are all those called? And, essentially, the proliferation of data models is is what it is. And we have these kind of classic things where we have everybody has their own data model, and we go to the data lakes view of the world. And then we have different conversations about, oh, we need to data warehouse it so that we all agree on things, but then nobody ever so I think this is where I think the future is going is much more I can adapt to any underlying data model for the application.
So I can tell you if we're talking about the security case, actually, what you mean in this context is this person with this access control review. Right? So and I think this is where a lot of where we're gonna go in in data management systems is this adaptability, and that's maybe something we'll talk about later, like the use of AI to help you do that adaptation because it was very hard to even in an organizational context, to enforce data models. I don't know if you've had that experience in your in your environment or you've seen that in other conversations you had. But data model enforcement, especially across different environments, it's hard. Right?
[00:13:11] Tobias Macey:
Yeah. One of the common refrains in that regard is the conversation around master data management or ensuring that you do some of the entity disambiguation to make sure that the person that you're talking about in this table is the same as the person that you're talking about in this table or the widget or what have you. And so then a lot of times, that brings up the conversation around, oh, well, just turn it into a knowledge graph with, you know, emphasis on just being a gross over oversimplification of the effort involved.
And so, yeah, the the adherence to specific domain models and specific semantics around those models is definitely, I think, one of the most long running and fraught challenges in data engineering as an exercise.
[00:14:00] Paul Groth:
Exactly. And I think, like, one of the things I talk about when we talk about knowledge graphs is I always say these are great quality datasets. I don't know if that's actually true. But, like, one of the things you like about them is, oh, everything has a unique ID and it's you know, we know exactly, you know, what we point to, the data engineering podcast. It has the URL. It's uniquely identified, and we can describe it as a podcast. But if you actually look at any knowledge graph, you go out and look at Wikidata, it's there's still a lot of undismiguated things. And even in knowledge graphs that are, you know, really rigorously maintained, you always have this proliferation of identity.
And in data management, we're just constantly trying to to figure that out. So I think in some sense, you have to have this conversation in your organization about what are we gonna actually govern. Right? What are we gonna enforce, and what are we gonna let be willy nilly? And I think, in general, the question is, can we design some newer techniques that makes it easier to do those kinds of of mapping?
[00:15:04] Tobias Macey:
Digging a bit more into knowledge graphs from looking at some of your publications and some of your profile information available on the web, it shows that you've spent a significant portion of your career invested in that particular niche area. It's a fairly large niche, but in the broad context, it is something that has typically been constrained to a group of people who are very enthusiastic about it. And then there is a long tail of people who are interested but don't take the time to actually invest in it. And in the past couple of years in particular, there has been a huge growth and interest in knowledge graphs within the context of large language models, both for purposes of using the graph as a means of grounding and refining the context for the models, as well as the ability of those models to be used for actually constructing the graphs out of messy, unstructured data as a means of very rapidly bootstrapping a knowledge graph effort.
But then there's also still the long tail of cleaning and curating and pruning that graph. And so I'm wondering if you can talk to some of the ways that you're seeing the overall industry adoption, maybe retread ground that is already well known of certain dead ends, etcetera, or maybe some of the ways that this renewed interest is reinvigorating that overall ecosystem of knowledge graphs as an area of both academic and industry pursuit?
[00:16:29] Paul Groth:
Yeah. So I think, like, you know, I think it's been pretty exciting about the introduction of elements just for the construction of of knowledge graphs. Right? So one of the biggest problems in building knowledge graphs is usually what you're doing in you know, ten years ago, is you're taking relational databases, and you were converting them into graphs. And you were set writing some sort of mapping language into some sort of graph structure, whether it be an RDF or a cipher or whatever modeling language you wanted, but you're writing mapping rules. And then or a a big area of interest was using natural language processing to do information extraction, named entity resolution, and relation extraction, those kinds of things, but you were building pretty complicated pipelines to get that done. And what we've seen is really the it's just so much easier to construct those information extraction pipelines.
It's actually even easier to construct mappings. Right? So if you have underlying relational databases, it's pretty easy to get models to create, mappings for you to a graph. And I think that's super exciting. Right? So it's makes that process of building a graph a lot, lot easier where a lot of I think a lot of people got kind of hung up on it. Right? Because it was a big it was a big ask. Right? So we when I was at Elsevier, we built knowledge graphs. So I was on, you know, one of the one of the people who was building some of those first knowledge graphs, And we still have projects going on with them, and you see it's much, much easier for them to now construct those models, do the integration. And what that means, I think, is that we can start talking about, you know, what are some of those benefits from having from creating a graph. And I I don't necessarily think it's having a graph database necessarily.
Although you just had a great talk with the guys from Kuzu DB, which I really enjoyed. But, right, I think it's this recognition is when you're building that graph, you start talking about defining those semantics that we were just talking about. Right? What do we mean by person? What do we mean by podcast? What do we mean by episode? What do we mean by custom? And writing those down and and making those explicit. And I think, you know, I don't know if you're, you know, Juan Cicada from, data.world/service now.
And, you know, we've written things together, and and we've been on this, and and he, in particular, has been on this journey of kind of emphasizing the role that, you know, you wanna play in getting that agreement. Right? So that focus there. So I think that's been very strong. Another really important interesting thing is with LLMs and the change is that we don't have to I think there was this idea maybe a couple years ago that if I was gonna do a knowledge graph, everything had to be in the knowledge graph. Right? I had to convert all my data into a knowledge graph. And that was never the case, but this was, like, the central dogma that you kind of heard. And now what we've seen is this idea that, hey. You could put part of your data in a graph where it makes sense, and we could connect out. We could connect out with with links to the underlying datasets. We can connect out with queries.
We can just leave data as literal, so actually blobs in the graph itself, and we just leave the data there. And that's actually okay. And we can take advantage of the fact that we have LLMs that are able to extract meaning from those literals without having to construct everything. And I think this easy to construction also, you know, looks at things like, hey. Actually, we can build a graph on the fly if that's useful for my downstream task. And this is where you've seen things like Microsoft's GraphReg. Right? So where okay. What they do is they construct a graph because it's helpful when we go out to prompt a a language model. So I think those are some of the, you know, the places where we see, okay, this I think maybe we're in the trough of what I don't know the these troughs or whatever things they could talk about, but, like, we're in a place where people are saying, okay. This is where a knowledge graph is useful in my whole data engineering pipeline, in my whole data management system. It's not an all or nothing proposition.
[00:20:51] Tobias Macey:
I think to one of the elements of the knowledge graph ecosystem that maybe hampered its growth is the lack of unification around how to properly model and represent those graphs and then query them where there was the evolution from I think the earliest one was maybe OWLs, and then SPARQL gained a lot of ground in the semantic web era, which a lot of people are still very interested in in and devoted to. And then Neo four j came out and popularized the property graph model, which simplified the overall means of constructing and interacting with the graphs, but there was still no standardized means of actually querying them. So you had Cypher from Neo four j. You had Gremlin as sort of the open approach. Now we have GQL as the standard track syntax that is largely modeled after Cypher, but then you also have things such as d graph that leaned in on GraphQL as a query interface. And I'm wondering what your thoughts are on some of the ways that that lack of unification and lack of, I guess, settled best practice within that ecosystem has maybe hamstrung its adoption and growth at least up until now.
[00:22:01] Paul Groth:
Yeah. So I think one of my one of the banes of my existence is this kind of the integration between or the connection between the principles of a knowledge graph and the underlying technology stacks you might need. So I think you can build a knowledge graph in a relational database. No problem. Right? There are some principles there. Okay. We're gonna have unique identifiers. We're gonna have relations, connections, links between them. We're gonna have types. Right? Those are useful concepts to have, and we do that all the time. If you stood up on a whiteboard, you draw nodes and edges and you and you draw attributes. It's kind of a a useful thing to do. And now we can use a different technology stacks to implement those. And this is something that I always try to convey to people. Maybe I need to do a better job or need to do more outreach. What I have seen, though, is, like, I'm very excited about things like Amazon Neptune, where what you see is, hey. If you need to be if you wanna query and cipher or you wanna have a property graph view of the world, if you wanna have more of a a RDF or semantic web or a triple view of the world, hey. You can have that under the sort of graph data that's there. Right? So this almost this independence of that. And that's where things like, you know, GQL are interesting where you have your relational database, and then you actually just, hey. If you wanna create that as a graph, here's how you do it. Here's are the nodes and here's are the edges technically in that language.
And so for me, that's kind of the exciting point because I think people are beginning and maybe this is the result of all this proliferation, right, of different technology stacks. That's okay because people under understand the underlying importance of the concepts. And, okay, now we're debating what's your favorite syntax, what's your fastest database, what is easiest to install, what can I get on my cloud? Those for me are may maybe that's a good sign for the future, right, where you see vendors that are very much around developer experience looking to use these concepts, but maybe in a way that's better built into what you wanna use as a as a developer.
[00:24:15] Tobias Macey:
Moving back over to LLMs and the impact that they're having on the practice of interacting with data and the purpose of data in a lot of cases, I'm curious what you're seeing in your areas of research or some of the particularly interesting or novel applications of LLMs to that data engineering context or ways that you're seeing it shift both the academic and industrial practices around data.
[00:24:42] Paul Groth:
Alright. So, I think there's a lot, I think. So one place I I think is this idea of multimodal data becoming actually integrated into your database. Right? So, essentially, having database operators that are LLMs. And here, I'd point to something like Emmanuel Trumer's work on SwellDB, or there's a extension to DuckDB called Flock MTL. And what these do is they essentially in your database query, you're writing your SQL query, you can just write what looks like a user defined function that's calling out to the LLM. And what's cool about that is you can have essentially the kind of declarative SQL style queries, but now you could do that over unstructured data. So kind of multimodal data really becoming a first class citizen, text, images directly in your database. And I think that might help us a lot because oftentimes, we've treated, you know, multimodal data, images, or videos as a completely separate you know, you put that in your s three bucket. You have a link out to that. You load it in. And now you can just store that in your database, and you could do operations on it without necessarily pushing that to, you know, a separate pipeline. And I think that's pretty an interesting exciting thing.
Another thing I think more out there, is this idea of large language models as databases themselves. Right? So there's, work by one of my colleagues in IndieLab, Jan Christophe Calo and Pablo Pappetti outside of Europe at Eurocom and also, again, Emmanuel Trumer, where, essentially, you treat the language model as a database. And you guys you're everybody's familiar with this. Language models know things. They have facts and data about the world. Do we necessarily need to build an actual database? Right? Can we just get our data directly from your language model? Right? And that actually solves some of the problems of data ingestion. Right? Because, essentially, you're just your pretraining is almost your data ingestion, if you think about it that way. Now there's lots of interesting ramifications of that. Right? Can you trust the facts coming back from your LLM?
Do you wanna treat it as that? Right? What's the trade off for, you know, keeping things in your LLM parameters versus actually using a traditional database or even looking at unstructured content? Right? So I think that's kind of a a very exciting research angle. And lastly, I don't think it's necessarily to do with LLMs per se, but it's the ramifications of training on LLMs and the kinds of architectures that we're building. So, essentially, all the major clouds, everybody's investing massive amounts of resources in building data centers designed for ML systems, right, designed for for training large language models.
And that means the underlying systems that we have system architectures that we have, are changing. And so the data management systems that we have are changing to kind of give take advantage of those those underlying changes in computer architecture. So here I would point to, Matteo Interlandi. He's at Microsoft Research, and he's done work on essentially building a database on top of GPUs by using PyTorch. So you have your SQL operators, your joins, your projections, your selections, and what they are is PyTorch functions that you then compile down, and you get these massive boosts in performance.
Why? You have all these GPUs. Right? It's because people are buying loads of GPUs to put in these data centers. So I think this is something pretty exciting in terms of, hey. Maybe we change the underlying architecture of our database to take advantage of all of the investment around AI hardware.
[00:28:37] Tobias Macey:
Yeah. That's definitely an interesting aspect as well where we have had GPU powered databases, again, as sort of a a niche effort for specific use cases, but they've typically been very expensive because they need a GPU, which is not cheap to operate. And another area of computer architecture evolution is the idea of neuromorphic computing where you're trying to change from just being a straight Von Neumann architecture to using interconnects and architectures that are more akin to the way that the human brain operates as far as the highly paralyzed, highly connected means of data interchange. Some of the, I guess, more, maybe, pedestrian ways of thinking about it are some of the work that AMD is doing where they're collocating their CPU on the same chip as the GPU to allow for higher parallelism and data interchange between them. There's a lot of work going into the network stack to be able to do direct networking from the CPU to the GPU to cut out a lot of the bus transfer, etcetera. And I'm wondering, what are some of the conversations that you've had around how that maybe changes the ways that we even think about the role of a database in that context and the shape of what it can be or should be?
[00:29:51] Paul Groth:
Well, I think what it means is, like I mean, the the conversations are, okay, we can put everything in well, we can put everything in memory, or we can drive a lot of things into memory. Right? Potentially, you can access some of the I mean, everybody if you're working with a MacBook, right, we all have massive amounts of memory that's available both to your GPU and to your CPU, and you can actually load your databases completely in memory, right, for the most part. Right? So, in fact, having single databases so this was the thoughts behind my, you know, my, the folks that live across the street at CWI in the creation of DuckDV is, look. Actually, you can have columnar databases, one, that look like SQLite, but because of the changes in computer architecture, we just have so much more RAM that we can actually deal with. And as you said, this kind of idea of, hey. We have so much memory bandwidth.
We can use these different kinds of chips on or different cores on your processor to do different data management function. And I think what that means for us as you know, if you're a data management practitioner is maybe you don't have to be in a cloud database right away. Right? So maybe that's the that's the the kind of message. Right? Or maybe we can have pipelines where we can just autoconstruct a database right away, put it up really quickly, do some things to it, shut it down, and maybe shift it across the network. Right? So I think that's what I mean when we have to think about the architectures that are available. Right? So that change how we do this kind of data management practice.
[00:31:31] Tobias Macey:
Absolutely. Yeah. It brings up one of the sort of tribal knowledge elements in the data industry of the idea of data gravity where, oh, you just wanna ship your compute to your data because it's too expensive to move the data anywhere else. With the introduction of things like DuckDB, KuzuDB, these very lightweight, easy to construct and throw away data stores for being able to do high performance compute and querying on them. There has been a massive sea change in terms of how people think about their data architecture where maybe they do still have a massive central repository of data, but not all of the computation happens across that repository.
And so there has been a a massive growth in edge compute or moving data to the edge for operation because then it reduces round trip latencies. And the the that's, again, something that people have been doing for a while, at least in some small subset of cases, but it it was generally a more specialized thing of, oh, I need to optimize this specific thing, and so I'm doing this edge computing to deal with this one case. But now it's becoming a more generalized pattern because the tooling and capabilities have expanded to incorporate that. And the idea of data federation with these lake house architectures as well means that the data doesn't even necessarily need to be situated in one location if you can massively parallelize the access of that underlying data being able to do more of the push down of the query pruning to say, oh, I actually only need to select this subset of data. I don't need to scan absolutely every file for everything. A lot of those, optimizations, I think, are definitely very interesting and changing the ways that we manage our data architectures, especially as the scope of interaction with those datasets grows as well, where largely it was very centralized. You had somebody who leaving out the sort of hyperscalers of, you know, Facebook, Google, Amazon, etcetera. For any given company, they would largely have a fairly geographically constrained audience, and so they didn't necessarily need to optimize for federation.
But as the availability of Internet, the availability of compute in multiple locations, obviously, with issues around data sovereignty, that changes the ways that we even need to be thinking about architecting our systems for various reasons. And so I think that's another motivation for broadening the architectural principles of how we think about what data operations need to be done where and by whom.
[00:33:59] Paul Groth:
Exactly. Right. And, you know, I'm in Europe. So, right, we think a lot about, okay. Do you do you have data liability? Do I even want this dataset? Right? Was it better to have it on the person's local device and maybe I'm sourcing data from, somewhere else? And, actually, given the power of, you know, edge devices and the ability for us to have things like rendering at Wasm. Right? I could compile things in a browser, run them really fast. I can pull down necessary data from a central's place, but I can leave right personal data with the person. So I think those are kinds of architectures that we're gonna see more and more of, as you said. And I think it's kind of exciting given the capacity of the edge device.
[00:34:40] Tobias Macey:
And then another aspect that we've touched on a little bit is this juxtaposition of data engineering in the context of an organization where I'm focused on the needs of a particular business and the data assets that they need to be able to introspect their operations or provide services to their customers, And then the data management approaches and requirements in research and academia where you're largely focused on either creating or curating datasets, and then a lot of the data management effort goes into how do I make sure that that this data asset is reproducible for use by other people, either for derivative research or for being able to replicate my research. And I'm wondering if you can give some of your perspective on how you're seeing the what what you see as the various areas of overlap and disjoint practices across those two arenas.
[00:35:36] Paul Groth:
Yeah. I think the biggest thing is that we don't have any top downness in, in research context. Right? So in business context, you may not think that you have top downness, but you have a little bit. We can come up with data models. Maybe we have lots of them, but we can come up with data models that we agree on. Right? Doing agreement on data models in research is very hard. People try to do it. Right? There's standards bodies to try to do this research, building biomedical ontologies, for example. But it's a really, a much more difficult process and a process that we don't often even do in research. Right? So in some sense, one of the nice things in business is you actually you do have a customer. Right? You either have an internal customer who you're trying to design a data set for, or you have, you know, the the end customer of the business where you could think about, okay. How am I designing my data management system for serving that? Whereas in research data, we have these kind of more broad ideas of, hey. I wanna put this dataset online, and hopefully somebody will reuse it. Right? And there's a big conversation that's about, hey. How much should you invest in actually creating all the metadata so that somebody could potentially, in the future, pick up your data and reuse it. And so one of the things I'm really excited about coming back to our conversations about LLMs is actually, can we use LLMs to help us create metadata on the fly for in research so that new users can come in and actually understand your data and maybe use it for my domain. So an interesting example is, I was just talking to a researcher in construction materials.
Right? And she was talking about, okay. I have a dataset around that is really focused on the material properties of this particular kind of concrete. Actually, that dataset might be useful for somebody for another scientist looking at life cycle optimizations of construction. Now those two scientists have different vocabularies. Now can we use an LLM to bridge that to that scientist? Now those kinds of conversations, I think you have a little bit in business, but in research, it's just doubly so. Right? The vocabularies are even farther apart. Right? The semantics are either farther apart. And so bridging those are, I think, one of the biggest challenges that we have in doing research data management. That brings up a interesting conversation I had. I forget exactly with whom it was,
[00:38:07] Tobias Macey:
but the overall gist of it was that the introduction of LLMs as a utility in the process of data management almost necessitates the inclusion of more external datasets within a business context because you have the ability to do that. And by virtue of using external third party datasets, you are able to enrich the decision making that you're doing because you have a much broader context, a lot more information than just whatever first party data you're able to generate or collect. And I think that's an an interesting parallel of within research being able to bridge across research domains because of the variance in the vocabularies used. And I think that that also is applicable in that external dataset use case where whoever created or curated that dataset or whoever is selling that dataset has a particular set of vocabularies.
They're probably targeting a particular set of industries, and so maybe that constrains who their focused target audience might be. But because LLMs can do some of that semantic mapping for you to translate into your specific business domain and vocabularies, it broadens the set of external assets that you might be able to incorporate and actually derive value from.
[00:39:25] Paul Groth:
Exactly. And I think one of the interesting knock on effects of exactly that is a little bit coming back to the idea of, like, large language models as a data database. Right? So I think then you start to think about, oh, do we need to actually manage our large language models and the accesses to large language models as an actually a data asset? Right? So are we just using it for its capabilities, or are we actually using the information in the LLM? And if I'm doing that, then maybe I need to start doing things like data versioning, having the right metadata about it, knowing if we have the right licenses, figuring out the data lineage that actually goes not that we just use this component, but actually we're sourcing information from that component. And I think this is something that if you're in this space, you're gonna have to think about as a as a data management practitioner.
[00:40:18] Tobias Macey:
Yeah. One of the interesting reductive summaries of large language models that I've come across is the idea that they are effectively a a very sophisticated lossy compression algorithm. Yeah. And so in your experience of working in this space of academic research on data management with this focus on machine learning use cases and the growing bidirectionality of that, what are some of the most interesting or innovative or unexpected ways that you have either seen LLMs applied in that context of data management as either a receiving end or a producer or just some of the interesting areas of research that either you or some of your colleagues are focused on that you think are worth highlighting for the audience?
[00:41:06] Paul Groth:
I think, for example, having point cloud databases. Right? Actually, I was just recently at Sigma. They had essentially you now can use point clouds and do, like, full structured queries across point clouds. Right? And also using these kind of techniques, LLM techniques. Or I saw one streaming database where they're taking datasets coming across from a from a health care dataset or a health care situation where you have sensor data coming in or and they had health care records of the person, and they were streaming that live and then using LLMs as operators to work over that dataset. So it really comes back to this kind of I've embedded the LLMs into the operators of the database. I I think that's been, like, very cool for me.
[00:41:56] Tobias Macey:
And in your experience of working in this space and doing this academic research on data management and its intersection with these ML use cases, what are some of the most interesting or unexpected or challenging lessons
[00:42:12] Paul Groth:
that you've learned personally? Yeah. So I think the number one thing that I think I learned in industry, but I keep learning it every day, is that real data is is always surprising. Right? The particular power of workloads figuring that out. So I'll give you a couple examples. Right? So just recently, like, one of my PhD students, David Jackson and doctor Hazar Hamuch, we built a knowledge graph, a database on bioactive, compounds from the literature. So it's called BaselDB. And it was just very interesting to see, like, how anamorphosis data is around what constitutes bioactivity, how can we, you know, take data from publications and kind of translate it into a core database.
We've worked with, with another PhD student of mine, Trissel Libertore. She actually built what we call fashion, DB, which is a knowledge graph of fashion data. Right? So actually do using LLMs to extract things like the context of fashion, how it was used, the different seasons, connecting that together because we're using that actually for innovation studies in in fashion. But just that real world data, every time we look at real data, it's always a mess, number one. And it's always super interesting because you just people don't do what you tell them to do in database one zero one. So I think that's always surprising. So it's also one of the reasons, like, I really like working with companies and real organizations and building real datasets ourselves is that you get that experience of what people are actually doing when they create datasets. Right? So the Excel spreadsheet problem just persists, and it persists at every scale. Right? So that's always super challenging, but super fun.
[00:44:08] Tobias Macey:
And as you continue to invest in researching this constantly shifting ecosystem? What are some of the areas of focus that you are interested in for the near to medium term?
[00:44:21] Paul Groth:
Right. So one of the big things is we were talking a little bit about, large language models as databases. How can we, constrain the information coming about from those, and how can we make sure that it's factual to more or less a degree? We have a recent publication out in that will come out in a conference on neurosymbolic AI later this year. That's exactly about this. Right? So how can we make sure that the facts that we get out of large language models are facts that we can use or at least we can say we can give some evidence for or against those those facts. So I'm excited about those.
I'm we talked in the very beginning about this idea of flexible data integration. So the idea of, hey. I have a completely new data model. What's your under underlying data, like, look like? What does your underlying data state look like? Can I automatically populate that new data model? And can we really drive down the cost of, yeah, I have a new view of the world. I have a new set of semantics. I have a new data model. Can we auto populate that? And can we have confidence in that auto population? So automated generation of mappings, automated information extraction. So that I'm super excited about. And lastly, I think coming back to the idea that, you know, people are surprising and real world data is surprising, I think our data engineering pipelines, we've always had some human in the loop. We've called them annotators or we've called them crowd, or we've called them experts back in the day. But I think this idea of data engineering pipelines as the combination of human AI and technical systems together, I think that only becomes more important as we kind of move up the stack with our data engineering pipeline. So those are some of the areas that I in the kind of medium term, we've been looking at and are pretty excited about.
[00:46:20] Tobias Macey:
Are there any other aspects of your areas of research focus or the overall impact of LLMs on the data engineering ecosystem or any of the other myriad topics that we touched on that we didn't discuss yet that you'd like to cover before we close out the show?
[00:46:36] Paul Groth:
No. I think I I think that that was a really great conversation. I really enjoyed that. So
[00:46:41] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:46:56] Paul Groth:
I think the biggest gap right now is the fact that it's difficult to choose one technology stack. I mean, it's not tooling, but, we didn't talk about it, but I also, as a side project, I have a start up company called Longfork dot ai. And one of the interesting things about that has been how do we choose the right technology stack. And I think this is really, really challenging. I think there's a lot of taste. It keeps moving. This is one of the reasons why I enjoy your podcast so much, Tobias, is that, you know, it keeps me up to date on the the changing nature of data management. But helping people really choose a technology stack that's right for them is exceedingly difficult. We talked about that with respect to knowledge gaps, but you see it in everything from workflows to, which cloud service I should use. I think having some way to do that, maybe we'll never get there, but I think this is the biggest challenge. Which technology stack should I should I use for my prong?
[00:48:01] Tobias Macey:
Yeah. It's definitely a constantly moving target. And for a certain period in the late nineties, early two thousands, you had very vertically integrated providers where it was, oh, well, you just go and use Informatica or pick your provider. And then in the late twenty tens into the beginning of twenty twenties, we had the growth of the, quote, unquote, modern data stack of, oh, just pick whatever you want, and then you just cobble them all together. It'll be fine. And then everybody realized, actually, that's a huge amount of work and really painful to deal with. And so I think now we're starting to move back into another area of consolidation where people are composing their own opinionated, vertically integrated stacks out of a grab bag of different technologies to say, we know that this is really painful. We're just going to do this part for you. Just buy our product, and it'll be amazing until you start to hit against the boundaries of that.
[00:49:02] Paul Groth:
So I have one last thing if we have a little time. Yeah. And I don't know where you put this, but I have a request for your, for your listeners. So we teach a lot of bachelor students in data management. If they can tell us one thing, if they send me an email, it would be great. One thing that we should teach our students coming out of bachelor computer science, curriculum, that I would love to know. This is always interesting, and I'd love to hear from any of your listeners who have thoughts or opinions on that. Absolutely. And yeah. So for anybody who does have opinions that want to,
[00:49:37] Tobias Macey:
send them along to Paul, his contact information is in the show notes. So with that, I thank you for taking the time today to to join me and share your thoughts and experiences on the areas of research that you're focused on as well as pontificating on the overall ecosystem. It's been a very, enjoyable conversation. I appreciate the time and energy that you and your group are putting into helping to gain more insight into this constantly shifting space. So, thank you again, and I hope you have a good rest of your day. Yeah. Thanks a lot, Tobias. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host sift data engineering podcast dot com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy? DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafull to book a demo and see how they turn months long migration nightmares into week long success stories. Your host is Tobias Macy, and today I'm interviewing Paul Groth about his research on knowledge graphs and data engineering. So, Paul, can you start by introducing yourself?
[00:01:06] Paul Groth:
Yeah. Thanks for having me. So I'm a professor at the University of Amsterdam where I lead a research group on intelligent data engineering. So this is really the intersection of how we use AI systems for data engineering and the other way around how we build better data management systems for AI.
[00:01:27] Tobias Macey:
And do you remember how you first got started working in data?
[00:01:31] Paul Groth:
Yeah. So it's kind of interesting. I, I I when I was doing my bachelor's degree, I worked at a AI institute of all things. And then afterwards, I started my PhD, and I started my PhD in distributed computing. And I was working with use cases around high performance computing, and in particular, their provenance or data provenance of the results of high performance computing systems. And in particular, at the time, there was a thing called the grid, which is like a precursor to what we call the cloud now. And then the questions was how you do you track data provenance across these high performance computing systems. And so I started doing things like developing data models for data provenance.
And then, as the career went along, I started getting more and more into data systems. So, when I first moved to the The Netherlands, I started building the first kind of graph databases from different things. So building a large biomedical knowledge graph, what we call biomedical knowledge graph now, called OpenFAX, which was integrating I think we had, like, 20 different databases that we were trying to integrate. So kind of my journey was, okay. We need to do data provenance. Then I got fascinated by, hey. We're integrating these data system data from multiple sources and then started building these kind of big data integration systems.
[00:03:06] Tobias Macey:
And provenance is an interesting one to dig into because I think I first came across that specific terminology in the context of data. I think it was more on the data science side of things, but that was, I wanna say, somewhere around the time frame of 2014 or 2015. And since then, the terminology, at least within the ecosystem of data engineering within the organizational context, has been subsumed by the term lineage. And I'm wondering if you could maybe give some of your interpretation around some of the variance of nuance between those two terms and where one maybe isn't quite a complete superset of the other.
[00:03:48] Paul Groth:
Yeah. I actually think pretty much when you're talking, data provenance and data lineage, you're talking about the same thing. Right? And I think, I think, technically, oftentimes, we think about data lineage and data provenance. There's been an ongoing discussion in, like, for example, research. Are we talking about just things that are happening within your database? Right? So how do rows get transformed, then we build views on top of that. And I think in the industry session and also in the broader context of data provenance, can we track across multiple systems? Right? So a lot of the original work in data provenance and data lineage was on one side really at this core database side of the world tracking within the database, but then there were a lot of people looking at workflow systems, and being able to track those workflow systems and how experimental results, were produced.
And I think what you see now is, you know, you have a lot of focus on more of what I would call the workflow system style provenance, so being able to trace across your your organization. Right? So I think we've broadened out that that term, and I think what we're really interested in is, hey. I got this result. How can I trace back across my whole data estate to where the input results are? And we might use a data catalog for that. We might do some sort of structured logging for that. But I think that's really, you know, going beyond just tracking within, you know, your relational database, for
[00:05:18] Tobias Macey:
And in terms of your academic focus, you mentioned that you're working with the intelligent data engineering lab. Obviously, a lot of evolution of data engineering happens rather organically in a large part, at least from my understanding and experience, has been very industry driven. And so I'm curious if you can give some of the details in terms of the areas of focus and some of the ways that data engineering can be conceived of as an academic pursuit and some of the interesting insights that you're able to gain by virtue of your research that are maybe translatable into some of the day to day of engineers working in that organizational context?
[00:05:58] Paul Groth:
Yeah. So, actually, it's a pretty funny story about how I well, not a funny story, but a story about how I started IndieLab. So I was a academic. I worked at the Free University Amsterdam, and then I actually decided to go work, at Elsevier, which is a large publisher information, content company, and I were was working in their research lab. And at Elsevier, I had a really a fantastic time, and it was really good insights into the practical day to days of managing large scale datasets. What What are the problems that you see, information silos, the problems of semantics getting to an agreement in an organization?
And that's one of the reasons, actually, after a couple years there, I just kept seeing these kind of fundamental problems that we continually have. And could we go back and take a step back and go back and think about research and think about some of those fundamental problems over the long long horizon? And in particular for me, these ideas of how do we do data integration better? How do we work? Do we have some, like, primary mechanisms to work with messy data? Can I tell you rigorously, here's how you do work with messy data? Here's your data integration set up. Can I do that? Can I have a a solution that I can say will work all the time? Can I have some rules about how to do that? So this is one of the reasons I kind of went back into, like, academia because of that motivating factor of all these problems that we kept seeing within, like, the the corporate context.
I think, in general, actually, there's a a bigger conversation between, you know, academia and industry in the field of data engineering or data management. Right? So, I mean, I think if you're working with a SQL database, that's coming from a lot of research in in, you know, back of the papers from COD and IBM. But, also, you see, like, this conversation between, you know, at places like Sigma, between academics and industry happening a lot. So I I think sometimes you see the motivations coming from industry problems and then, you know, academics taking a step back and and having a way to formalize those or come up with efficient ways to do that. Like, I think maybe you've had on this thing, for example, the the emergence of column or databases would be an example of that kind of thing.
Specific to IndieLab, right, so what we're trying to do is work on a number of different kind of thematic areas. One is how to how do we automatically build databases or automatically build knowledge graphs from multiple sources. Right? So can I come to you and say, okay? Here are ways given heterogeneous data. Can I construct you a super high quality dataset that's super useful? Another place that we are working on is what we call context aware data systems. So this is things like, how does your data management system cope with different users, different environments, changes in the dynamics around this so you can think of data systems that are designed for digital twin? And lastly, like, I'm very compelled by data scientists and and data workers.
And so how can we design data management systems for helping with machine learning? So, for example, how can we do deeper data quality assessments? Right? So if you're building some data unit tests, can we tell you, you know, how that, would work? Or, like, a recent paper we had was looking at the data handling impact on machine learning models. So how do you change your datasets when you're doing data prep? How does that impact your machine learning models? And it's kind of surprising. It's more it's maybe what you would expect that it's more than you than you think it would be. Right? So your machine learning models can be very impacted by your data handling and sometimes they're not the way you would expect. Those are some of the areas we're working on, and I think they have messages for, for people in in practice.
[00:10:13] Tobias Macey:
There are a lot of interesting things there to unpack, in particular that context aware data systems. One one of the things that immediately came to mind is the challenges of managing the nuance of access control in data warehouse environments, etcetera, where a lot of the suggested best practice is to use things like attribute based access controls. But then the challenge is, okay. Well, how do you gain access to the appropriate attributes? How do you determine that mapping? And how does that propagate across a constantly shifting set of data models where in the warehouse context, obviously, you want to make sure that you have consistent naming, consistent tagging, but there's typically not a broad guarantee that those are all in place in the way that you want them to be. And so there are numerous points at which you can accidentally leak access because you don't have the right attributes or because you haven't explicitly added the appropriate row level permissions or column level permissions. And I'm wondering what your thoughts are, if there have any been any areas of your research that touch on some of those complexities of identity and access control and managing the contextual elements of who should be able to do what and for what reasons.
[00:11:28] Paul Groth:
Yeah. So we haven't worked a lot on access control per se, but what we do see is exactly what you see is this divergence between, I would call it semantics. Right? So what do we call different attributes in different spaces. Right? So you see this all the time with things like the idea of customer versus person, right, or customer versus user. Right? Is your user your customer? What what's the role there? Or are you a sysadmin, or are you an administrator? Are you a steward? What are all those called? And, essentially, the proliferation of data models is is what it is. And we have these kind of classic things where we have everybody has their own data model, and we go to the data lakes view of the world. And then we have different conversations about, oh, we need to data warehouse it so that we all agree on things, but then nobody ever so I think this is where I think the future is going is much more I can adapt to any underlying data model for the application.
So I can tell you if we're talking about the security case, actually, what you mean in this context is this person with this access control review. Right? So and I think this is where a lot of where we're gonna go in in data management systems is this adaptability, and that's maybe something we'll talk about later, like the use of AI to help you do that adaptation because it was very hard to even in an organizational context, to enforce data models. I don't know if you've had that experience in your in your environment or you've seen that in other conversations you had. But data model enforcement, especially across different environments, it's hard. Right?
[00:13:11] Tobias Macey:
Yeah. One of the common refrains in that regard is the conversation around master data management or ensuring that you do some of the entity disambiguation to make sure that the person that you're talking about in this table is the same as the person that you're talking about in this table or the widget or what have you. And so then a lot of times, that brings up the conversation around, oh, well, just turn it into a knowledge graph with, you know, emphasis on just being a gross over oversimplification of the effort involved.
And so, yeah, the the adherence to specific domain models and specific semantics around those models is definitely, I think, one of the most long running and fraught challenges in data engineering as an exercise.
[00:14:00] Paul Groth:
Exactly. And I think, like, one of the things I talk about when we talk about knowledge graphs is I always say these are great quality datasets. I don't know if that's actually true. But, like, one of the things you like about them is, oh, everything has a unique ID and it's you know, we know exactly, you know, what we point to, the data engineering podcast. It has the URL. It's uniquely identified, and we can describe it as a podcast. But if you actually look at any knowledge graph, you go out and look at Wikidata, it's there's still a lot of undismiguated things. And even in knowledge graphs that are, you know, really rigorously maintained, you always have this proliferation of identity.
And in data management, we're just constantly trying to to figure that out. So I think in some sense, you have to have this conversation in your organization about what are we gonna actually govern. Right? What are we gonna enforce, and what are we gonna let be willy nilly? And I think, in general, the question is, can we design some newer techniques that makes it easier to do those kinds of of mapping?
[00:15:04] Tobias Macey:
Digging a bit more into knowledge graphs from looking at some of your publications and some of your profile information available on the web, it shows that you've spent a significant portion of your career invested in that particular niche area. It's a fairly large niche, but in the broad context, it is something that has typically been constrained to a group of people who are very enthusiastic about it. And then there is a long tail of people who are interested but don't take the time to actually invest in it. And in the past couple of years in particular, there has been a huge growth and interest in knowledge graphs within the context of large language models, both for purposes of using the graph as a means of grounding and refining the context for the models, as well as the ability of those models to be used for actually constructing the graphs out of messy, unstructured data as a means of very rapidly bootstrapping a knowledge graph effort.
But then there's also still the long tail of cleaning and curating and pruning that graph. And so I'm wondering if you can talk to some of the ways that you're seeing the overall industry adoption, maybe retread ground that is already well known of certain dead ends, etcetera, or maybe some of the ways that this renewed interest is reinvigorating that overall ecosystem of knowledge graphs as an area of both academic and industry pursuit?
[00:16:29] Paul Groth:
Yeah. So I think, like, you know, I think it's been pretty exciting about the introduction of elements just for the construction of of knowledge graphs. Right? So one of the biggest problems in building knowledge graphs is usually what you're doing in you know, ten years ago, is you're taking relational databases, and you were converting them into graphs. And you were set writing some sort of mapping language into some sort of graph structure, whether it be an RDF or a cipher or whatever modeling language you wanted, but you're writing mapping rules. And then or a a big area of interest was using natural language processing to do information extraction, named entity resolution, and relation extraction, those kinds of things, but you were building pretty complicated pipelines to get that done. And what we've seen is really the it's just so much easier to construct those information extraction pipelines.
It's actually even easier to construct mappings. Right? So if you have underlying relational databases, it's pretty easy to get models to create, mappings for you to a graph. And I think that's super exciting. Right? So it's makes that process of building a graph a lot, lot easier where a lot of I think a lot of people got kind of hung up on it. Right? Because it was a big it was a big ask. Right? So we when I was at Elsevier, we built knowledge graphs. So I was on, you know, one of the one of the people who was building some of those first knowledge graphs, And we still have projects going on with them, and you see it's much, much easier for them to now construct those models, do the integration. And what that means, I think, is that we can start talking about, you know, what are some of those benefits from having from creating a graph. And I I don't necessarily think it's having a graph database necessarily.
Although you just had a great talk with the guys from Kuzu DB, which I really enjoyed. But, right, I think it's this recognition is when you're building that graph, you start talking about defining those semantics that we were just talking about. Right? What do we mean by person? What do we mean by podcast? What do we mean by episode? What do we mean by custom? And writing those down and and making those explicit. And I think, you know, I don't know if you're, you know, Juan Cicada from, data.world/service now.
And, you know, we've written things together, and and we've been on this, and and he, in particular, has been on this journey of kind of emphasizing the role that, you know, you wanna play in getting that agreement. Right? So that focus there. So I think that's been very strong. Another really important interesting thing is with LLMs and the change is that we don't have to I think there was this idea maybe a couple years ago that if I was gonna do a knowledge graph, everything had to be in the knowledge graph. Right? I had to convert all my data into a knowledge graph. And that was never the case, but this was, like, the central dogma that you kind of heard. And now what we've seen is this idea that, hey. You could put part of your data in a graph where it makes sense, and we could connect out. We could connect out with with links to the underlying datasets. We can connect out with queries.
We can just leave data as literal, so actually blobs in the graph itself, and we just leave the data there. And that's actually okay. And we can take advantage of the fact that we have LLMs that are able to extract meaning from those literals without having to construct everything. And I think this easy to construction also, you know, looks at things like, hey. Actually, we can build a graph on the fly if that's useful for my downstream task. And this is where you've seen things like Microsoft's GraphReg. Right? So where okay. What they do is they construct a graph because it's helpful when we go out to prompt a a language model. So I think those are some of the, you know, the places where we see, okay, this I think maybe we're in the trough of what I don't know the these troughs or whatever things they could talk about, but, like, we're in a place where people are saying, okay. This is where a knowledge graph is useful in my whole data engineering pipeline, in my whole data management system. It's not an all or nothing proposition.
[00:20:51] Tobias Macey:
I think to one of the elements of the knowledge graph ecosystem that maybe hampered its growth is the lack of unification around how to properly model and represent those graphs and then query them where there was the evolution from I think the earliest one was maybe OWLs, and then SPARQL gained a lot of ground in the semantic web era, which a lot of people are still very interested in in and devoted to. And then Neo four j came out and popularized the property graph model, which simplified the overall means of constructing and interacting with the graphs, but there was still no standardized means of actually querying them. So you had Cypher from Neo four j. You had Gremlin as sort of the open approach. Now we have GQL as the standard track syntax that is largely modeled after Cypher, but then you also have things such as d graph that leaned in on GraphQL as a query interface. And I'm wondering what your thoughts are on some of the ways that that lack of unification and lack of, I guess, settled best practice within that ecosystem has maybe hamstrung its adoption and growth at least up until now.
[00:22:01] Paul Groth:
Yeah. So I think one of my one of the banes of my existence is this kind of the integration between or the connection between the principles of a knowledge graph and the underlying technology stacks you might need. So I think you can build a knowledge graph in a relational database. No problem. Right? There are some principles there. Okay. We're gonna have unique identifiers. We're gonna have relations, connections, links between them. We're gonna have types. Right? Those are useful concepts to have, and we do that all the time. If you stood up on a whiteboard, you draw nodes and edges and you and you draw attributes. It's kind of a a useful thing to do. And now we can use a different technology stacks to implement those. And this is something that I always try to convey to people. Maybe I need to do a better job or need to do more outreach. What I have seen, though, is, like, I'm very excited about things like Amazon Neptune, where what you see is, hey. If you need to be if you wanna query and cipher or you wanna have a property graph view of the world, if you wanna have more of a a RDF or semantic web or a triple view of the world, hey. You can have that under the sort of graph data that's there. Right? So this almost this independence of that. And that's where things like, you know, GQL are interesting where you have your relational database, and then you actually just, hey. If you wanna create that as a graph, here's how you do it. Here's are the nodes and here's are the edges technically in that language.
And so for me, that's kind of the exciting point because I think people are beginning and maybe this is the result of all this proliferation, right, of different technology stacks. That's okay because people under understand the underlying importance of the concepts. And, okay, now we're debating what's your favorite syntax, what's your fastest database, what is easiest to install, what can I get on my cloud? Those for me are may maybe that's a good sign for the future, right, where you see vendors that are very much around developer experience looking to use these concepts, but maybe in a way that's better built into what you wanna use as a as a developer.
[00:24:15] Tobias Macey:
Moving back over to LLMs and the impact that they're having on the practice of interacting with data and the purpose of data in a lot of cases, I'm curious what you're seeing in your areas of research or some of the particularly interesting or novel applications of LLMs to that data engineering context or ways that you're seeing it shift both the academic and industrial practices around data.
[00:24:42] Paul Groth:
Alright. So, I think there's a lot, I think. So one place I I think is this idea of multimodal data becoming actually integrated into your database. Right? So, essentially, having database operators that are LLMs. And here, I'd point to something like Emmanuel Trumer's work on SwellDB, or there's a extension to DuckDB called Flock MTL. And what these do is they essentially in your database query, you're writing your SQL query, you can just write what looks like a user defined function that's calling out to the LLM. And what's cool about that is you can have essentially the kind of declarative SQL style queries, but now you could do that over unstructured data. So kind of multimodal data really becoming a first class citizen, text, images directly in your database. And I think that might help us a lot because oftentimes, we've treated, you know, multimodal data, images, or videos as a completely separate you know, you put that in your s three bucket. You have a link out to that. You load it in. And now you can just store that in your database, and you could do operations on it without necessarily pushing that to, you know, a separate pipeline. And I think that's pretty an interesting exciting thing.
Another thing I think more out there, is this idea of large language models as databases themselves. Right? So there's, work by one of my colleagues in IndieLab, Jan Christophe Calo and Pablo Pappetti outside of Europe at Eurocom and also, again, Emmanuel Trumer, where, essentially, you treat the language model as a database. And you guys you're everybody's familiar with this. Language models know things. They have facts and data about the world. Do we necessarily need to build an actual database? Right? Can we just get our data directly from your language model? Right? And that actually solves some of the problems of data ingestion. Right? Because, essentially, you're just your pretraining is almost your data ingestion, if you think about it that way. Now there's lots of interesting ramifications of that. Right? Can you trust the facts coming back from your LLM?
Do you wanna treat it as that? Right? What's the trade off for, you know, keeping things in your LLM parameters versus actually using a traditional database or even looking at unstructured content? Right? So I think that's kind of a a very exciting research angle. And lastly, I don't think it's necessarily to do with LLMs per se, but it's the ramifications of training on LLMs and the kinds of architectures that we're building. So, essentially, all the major clouds, everybody's investing massive amounts of resources in building data centers designed for ML systems, right, designed for for training large language models.
And that means the underlying systems that we have system architectures that we have, are changing. And so the data management systems that we have are changing to kind of give take advantage of those those underlying changes in computer architecture. So here I would point to, Matteo Interlandi. He's at Microsoft Research, and he's done work on essentially building a database on top of GPUs by using PyTorch. So you have your SQL operators, your joins, your projections, your selections, and what they are is PyTorch functions that you then compile down, and you get these massive boosts in performance.
Why? You have all these GPUs. Right? It's because people are buying loads of GPUs to put in these data centers. So I think this is something pretty exciting in terms of, hey. Maybe we change the underlying architecture of our database to take advantage of all of the investment around AI hardware.
[00:28:37] Tobias Macey:
Yeah. That's definitely an interesting aspect as well where we have had GPU powered databases, again, as sort of a a niche effort for specific use cases, but they've typically been very expensive because they need a GPU, which is not cheap to operate. And another area of computer architecture evolution is the idea of neuromorphic computing where you're trying to change from just being a straight Von Neumann architecture to using interconnects and architectures that are more akin to the way that the human brain operates as far as the highly paralyzed, highly connected means of data interchange. Some of the, I guess, more, maybe, pedestrian ways of thinking about it are some of the work that AMD is doing where they're collocating their CPU on the same chip as the GPU to allow for higher parallelism and data interchange between them. There's a lot of work going into the network stack to be able to do direct networking from the CPU to the GPU to cut out a lot of the bus transfer, etcetera. And I'm wondering, what are some of the conversations that you've had around how that maybe changes the ways that we even think about the role of a database in that context and the shape of what it can be or should be?
[00:29:51] Paul Groth:
Well, I think what it means is, like I mean, the the conversations are, okay, we can put everything in well, we can put everything in memory, or we can drive a lot of things into memory. Right? Potentially, you can access some of the I mean, everybody if you're working with a MacBook, right, we all have massive amounts of memory that's available both to your GPU and to your CPU, and you can actually load your databases completely in memory, right, for the most part. Right? So, in fact, having single databases so this was the thoughts behind my, you know, my, the folks that live across the street at CWI in the creation of DuckDV is, look. Actually, you can have columnar databases, one, that look like SQLite, but because of the changes in computer architecture, we just have so much more RAM that we can actually deal with. And as you said, this kind of idea of, hey. We have so much memory bandwidth.
We can use these different kinds of chips on or different cores on your processor to do different data management function. And I think what that means for us as you know, if you're a data management practitioner is maybe you don't have to be in a cloud database right away. Right? So maybe that's the that's the the kind of message. Right? Or maybe we can have pipelines where we can just autoconstruct a database right away, put it up really quickly, do some things to it, shut it down, and maybe shift it across the network. Right? So I think that's what I mean when we have to think about the architectures that are available. Right? So that change how we do this kind of data management practice.
[00:31:31] Tobias Macey:
Absolutely. Yeah. It brings up one of the sort of tribal knowledge elements in the data industry of the idea of data gravity where, oh, you just wanna ship your compute to your data because it's too expensive to move the data anywhere else. With the introduction of things like DuckDB, KuzuDB, these very lightweight, easy to construct and throw away data stores for being able to do high performance compute and querying on them. There has been a massive sea change in terms of how people think about their data architecture where maybe they do still have a massive central repository of data, but not all of the computation happens across that repository.
And so there has been a a massive growth in edge compute or moving data to the edge for operation because then it reduces round trip latencies. And the the that's, again, something that people have been doing for a while, at least in some small subset of cases, but it it was generally a more specialized thing of, oh, I need to optimize this specific thing, and so I'm doing this edge computing to deal with this one case. But now it's becoming a more generalized pattern because the tooling and capabilities have expanded to incorporate that. And the idea of data federation with these lake house architectures as well means that the data doesn't even necessarily need to be situated in one location if you can massively parallelize the access of that underlying data being able to do more of the push down of the query pruning to say, oh, I actually only need to select this subset of data. I don't need to scan absolutely every file for everything. A lot of those, optimizations, I think, are definitely very interesting and changing the ways that we manage our data architectures, especially as the scope of interaction with those datasets grows as well, where largely it was very centralized. You had somebody who leaving out the sort of hyperscalers of, you know, Facebook, Google, Amazon, etcetera. For any given company, they would largely have a fairly geographically constrained audience, and so they didn't necessarily need to optimize for federation.
But as the availability of Internet, the availability of compute in multiple locations, obviously, with issues around data sovereignty, that changes the ways that we even need to be thinking about architecting our systems for various reasons. And so I think that's another motivation for broadening the architectural principles of how we think about what data operations need to be done where and by whom.
[00:33:59] Paul Groth:
Exactly. Right. And, you know, I'm in Europe. So, right, we think a lot about, okay. Do you do you have data liability? Do I even want this dataset? Right? Was it better to have it on the person's local device and maybe I'm sourcing data from, somewhere else? And, actually, given the power of, you know, edge devices and the ability for us to have things like rendering at Wasm. Right? I could compile things in a browser, run them really fast. I can pull down necessary data from a central's place, but I can leave right personal data with the person. So I think those are kinds of architectures that we're gonna see more and more of, as you said. And I think it's kind of exciting given the capacity of the edge device.
[00:34:40] Tobias Macey:
And then another aspect that we've touched on a little bit is this juxtaposition of data engineering in the context of an organization where I'm focused on the needs of a particular business and the data assets that they need to be able to introspect their operations or provide services to their customers, And then the data management approaches and requirements in research and academia where you're largely focused on either creating or curating datasets, and then a lot of the data management effort goes into how do I make sure that that this data asset is reproducible for use by other people, either for derivative research or for being able to replicate my research. And I'm wondering if you can give some of your perspective on how you're seeing the what what you see as the various areas of overlap and disjoint practices across those two arenas.
[00:35:36] Paul Groth:
Yeah. I think the biggest thing is that we don't have any top downness in, in research context. Right? So in business context, you may not think that you have top downness, but you have a little bit. We can come up with data models. Maybe we have lots of them, but we can come up with data models that we agree on. Right? Doing agreement on data models in research is very hard. People try to do it. Right? There's standards bodies to try to do this research, building biomedical ontologies, for example. But it's a really, a much more difficult process and a process that we don't often even do in research. Right? So in some sense, one of the nice things in business is you actually you do have a customer. Right? You either have an internal customer who you're trying to design a data set for, or you have, you know, the the end customer of the business where you could think about, okay. How am I designing my data management system for serving that? Whereas in research data, we have these kind of more broad ideas of, hey. I wanna put this dataset online, and hopefully somebody will reuse it. Right? And there's a big conversation that's about, hey. How much should you invest in actually creating all the metadata so that somebody could potentially, in the future, pick up your data and reuse it. And so one of the things I'm really excited about coming back to our conversations about LLMs is actually, can we use LLMs to help us create metadata on the fly for in research so that new users can come in and actually understand your data and maybe use it for my domain. So an interesting example is, I was just talking to a researcher in construction materials.
Right? And she was talking about, okay. I have a dataset around that is really focused on the material properties of this particular kind of concrete. Actually, that dataset might be useful for somebody for another scientist looking at life cycle optimizations of construction. Now those two scientists have different vocabularies. Now can we use an LLM to bridge that to that scientist? Now those kinds of conversations, I think you have a little bit in business, but in research, it's just doubly so. Right? The vocabularies are even farther apart. Right? The semantics are either farther apart. And so bridging those are, I think, one of the biggest challenges that we have in doing research data management. That brings up a interesting conversation I had. I forget exactly with whom it was,
[00:38:07] Tobias Macey:
but the overall gist of it was that the introduction of LLMs as a utility in the process of data management almost necessitates the inclusion of more external datasets within a business context because you have the ability to do that. And by virtue of using external third party datasets, you are able to enrich the decision making that you're doing because you have a much broader context, a lot more information than just whatever first party data you're able to generate or collect. And I think that's an an interesting parallel of within research being able to bridge across research domains because of the variance in the vocabularies used. And I think that that also is applicable in that external dataset use case where whoever created or curated that dataset or whoever is selling that dataset has a particular set of vocabularies.
They're probably targeting a particular set of industries, and so maybe that constrains who their focused target audience might be. But because LLMs can do some of that semantic mapping for you to translate into your specific business domain and vocabularies, it broadens the set of external assets that you might be able to incorporate and actually derive value from.
[00:39:25] Paul Groth:
Exactly. And I think one of the interesting knock on effects of exactly that is a little bit coming back to the idea of, like, large language models as a data database. Right? So I think then you start to think about, oh, do we need to actually manage our large language models and the accesses to large language models as an actually a data asset? Right? So are we just using it for its capabilities, or are we actually using the information in the LLM? And if I'm doing that, then maybe I need to start doing things like data versioning, having the right metadata about it, knowing if we have the right licenses, figuring out the data lineage that actually goes not that we just use this component, but actually we're sourcing information from that component. And I think this is something that if you're in this space, you're gonna have to think about as a as a data management practitioner.
[00:40:18] Tobias Macey:
Yeah. One of the interesting reductive summaries of large language models that I've come across is the idea that they are effectively a a very sophisticated lossy compression algorithm. Yeah. And so in your experience of working in this space of academic research on data management with this focus on machine learning use cases and the growing bidirectionality of that, what are some of the most interesting or innovative or unexpected ways that you have either seen LLMs applied in that context of data management as either a receiving end or a producer or just some of the interesting areas of research that either you or some of your colleagues are focused on that you think are worth highlighting for the audience?
[00:41:06] Paul Groth:
I think, for example, having point cloud databases. Right? Actually, I was just recently at Sigma. They had essentially you now can use point clouds and do, like, full structured queries across point clouds. Right? And also using these kind of techniques, LLM techniques. Or I saw one streaming database where they're taking datasets coming across from a from a health care dataset or a health care situation where you have sensor data coming in or and they had health care records of the person, and they were streaming that live and then using LLMs as operators to work over that dataset. So it really comes back to this kind of I've embedded the LLMs into the operators of the database. I I think that's been, like, very cool for me.
[00:41:56] Tobias Macey:
And in your experience of working in this space and doing this academic research on data management and its intersection with these ML use cases, what are some of the most interesting or unexpected or challenging lessons
[00:42:12] Paul Groth:
that you've learned personally? Yeah. So I think the number one thing that I think I learned in industry, but I keep learning it every day, is that real data is is always surprising. Right? The particular power of workloads figuring that out. So I'll give you a couple examples. Right? So just recently, like, one of my PhD students, David Jackson and doctor Hazar Hamuch, we built a knowledge graph, a database on bioactive, compounds from the literature. So it's called BaselDB. And it was just very interesting to see, like, how anamorphosis data is around what constitutes bioactivity, how can we, you know, take data from publications and kind of translate it into a core database.
We've worked with, with another PhD student of mine, Trissel Libertore. She actually built what we call fashion, DB, which is a knowledge graph of fashion data. Right? So actually do using LLMs to extract things like the context of fashion, how it was used, the different seasons, connecting that together because we're using that actually for innovation studies in in fashion. But just that real world data, every time we look at real data, it's always a mess, number one. And it's always super interesting because you just people don't do what you tell them to do in database one zero one. So I think that's always surprising. So it's also one of the reasons, like, I really like working with companies and real organizations and building real datasets ourselves is that you get that experience of what people are actually doing when they create datasets. Right? So the Excel spreadsheet problem just persists, and it persists at every scale. Right? So that's always super challenging, but super fun.
[00:44:08] Tobias Macey:
And as you continue to invest in researching this constantly shifting ecosystem? What are some of the areas of focus that you are interested in for the near to medium term?
[00:44:21] Paul Groth:
Right. So one of the big things is we were talking a little bit about, large language models as databases. How can we, constrain the information coming about from those, and how can we make sure that it's factual to more or less a degree? We have a recent publication out in that will come out in a conference on neurosymbolic AI later this year. That's exactly about this. Right? So how can we make sure that the facts that we get out of large language models are facts that we can use or at least we can say we can give some evidence for or against those those facts. So I'm excited about those.
I'm we talked in the very beginning about this idea of flexible data integration. So the idea of, hey. I have a completely new data model. What's your under underlying data, like, look like? What does your underlying data state look like? Can I automatically populate that new data model? And can we really drive down the cost of, yeah, I have a new view of the world. I have a new set of semantics. I have a new data model. Can we auto populate that? And can we have confidence in that auto population? So automated generation of mappings, automated information extraction. So that I'm super excited about. And lastly, I think coming back to the idea that, you know, people are surprising and real world data is surprising, I think our data engineering pipelines, we've always had some human in the loop. We've called them annotators or we've called them crowd, or we've called them experts back in the day. But I think this idea of data engineering pipelines as the combination of human AI and technical systems together, I think that only becomes more important as we kind of move up the stack with our data engineering pipeline. So those are some of the areas that I in the kind of medium term, we've been looking at and are pretty excited about.
[00:46:20] Tobias Macey:
Are there any other aspects of your areas of research focus or the overall impact of LLMs on the data engineering ecosystem or any of the other myriad topics that we touched on that we didn't discuss yet that you'd like to cover before we close out the show?
[00:46:36] Paul Groth:
No. I think I I think that that was a really great conversation. I really enjoyed that. So
[00:46:41] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:46:56] Paul Groth:
I think the biggest gap right now is the fact that it's difficult to choose one technology stack. I mean, it's not tooling, but, we didn't talk about it, but I also, as a side project, I have a start up company called Longfork dot ai. And one of the interesting things about that has been how do we choose the right technology stack. And I think this is really, really challenging. I think there's a lot of taste. It keeps moving. This is one of the reasons why I enjoy your podcast so much, Tobias, is that, you know, it keeps me up to date on the the changing nature of data management. But helping people really choose a technology stack that's right for them is exceedingly difficult. We talked about that with respect to knowledge gaps, but you see it in everything from workflows to, which cloud service I should use. I think having some way to do that, maybe we'll never get there, but I think this is the biggest challenge. Which technology stack should I should I use for my prong?
[00:48:01] Tobias Macey:
Yeah. It's definitely a constantly moving target. And for a certain period in the late nineties, early two thousands, you had very vertically integrated providers where it was, oh, well, you just go and use Informatica or pick your provider. And then in the late twenty tens into the beginning of twenty twenties, we had the growth of the, quote, unquote, modern data stack of, oh, just pick whatever you want, and then you just cobble them all together. It'll be fine. And then everybody realized, actually, that's a huge amount of work and really painful to deal with. And so I think now we're starting to move back into another area of consolidation where people are composing their own opinionated, vertically integrated stacks out of a grab bag of different technologies to say, we know that this is really painful. We're just going to do this part for you. Just buy our product, and it'll be amazing until you start to hit against the boundaries of that.
[00:49:02] Paul Groth:
So I have one last thing if we have a little time. Yeah. And I don't know where you put this, but I have a request for your, for your listeners. So we teach a lot of bachelor students in data management. If they can tell us one thing, if they send me an email, it would be great. One thing that we should teach our students coming out of bachelor computer science, curriculum, that I would love to know. This is always interesting, and I'd love to hear from any of your listeners who have thoughts or opinions on that. Absolutely. And yeah. So for anybody who does have opinions that want to,
[00:49:37] Tobias Macey:
send them along to Paul, his contact information is in the show notes. So with that, I thank you for taking the time today to to join me and share your thoughts and experiences on the areas of research that you're focused on as well as pontificating on the overall ecosystem. It's been a very, enjoyable conversation. I appreciate the time and energy that you and your group are putting into helping to gain more insight into this constantly shifting space. So, thank you again, and I hope you have a good rest of your day. Yeah. Thanks a lot, Tobias. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host sift data engineering podcast dot com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Paul Grothe and Knowledge Graphs
Understanding Data Provenance vs. Lineage
Academic Pursuits in Data Engineering
Context-Aware Data Systems
Challenges in Knowledge Graph Adoption
Standardization in Knowledge Graphs
Impact of LLMs on Data Engineering
Data Architecture Evolution and Edge Computing
LLMs as a Data Management Tool
Future Research Directions in Data Management