From Academia to Industry: Bridging Data Engineering Challenges

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?

DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems.

Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price.

Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafull

to book a demo and see how they turn months long migration nightmares into week long success stories. Your host is Tobias Macy, and today I'm interviewing Paul Groth about his research on knowledge graphs and data engineering. So, Paul, can you start by introducing yourself?

Yeah. Thanks for having me. So I'm a professor at the University of Amsterdam where I lead a research group on intelligent data engineering. So this is really the

intersection of how we use AI systems for data engineering and the other way around how we build better

data management systems for AI.

And do you remember how you first got started working in data?

Yeah. So it's kind of interesting. I,

I I when I was doing my bachelor's degree, I worked at a AI institute of all things.

And then afterwards, I started my PhD, and I started my PhD

in distributed computing.

And I was working with use cases around high performance computing, and in particular, their

provenance

or data provenance

of the results of high performance computing systems.

And in particular,

at the time,

there was a thing called the grid,

which is like a precursor to what we call the cloud now. And then the questions was how you do you track

data provenance

across

these

high performance computing systems. And so I started doing things like developing data models for data provenance.

And then,

as the career went along,

I started getting more and more into

data systems. So, when I first moved to the

The Netherlands, I started building

the first kind of graph databases

from different things. So building a large

biomedical

knowledge graph, what we call biomedical knowledge graph now, called OpenFAX,

which was

integrating

I think we had, like, 20 different databases that we were trying to integrate. So

kind of my journey was, okay. We need to do data provenance.

Then I got fascinated by, hey. We're integrating these data system data from multiple sources and then started building these kind of big data integration systems.

And provenance is an interesting one to dig into because I think I first came across that specific terminology

in the context of data. I think it was more on the data science side of things, but that was, I wanna say, somewhere around the time frame of 2014 or 2015.

And

since then, the terminology,

at least within the ecosystem

of data engineering within the organizational

context, has been subsumed by the term lineage.

And I'm wondering if you could maybe give some of your interpretation

around some of the variance of nuance between those two terms and where one maybe isn't quite a complete superset of the other.

Yeah. I actually think pretty much when you're talking,

data provenance and data lineage, you're talking about the same thing. Right? And I think,

I think, technically, oftentimes, we think about data lineage and data provenance. There's been an ongoing discussion in, like, for example, research.

Are we talking about just things that are happening within your database? Right? So how do

rows get transformed, then we build views on top of that. And I think in the industry session and also in the broader context of data provenance, can we track

across multiple systems? Right? So a lot of the original work in data provenance and data lineage was on one side really at this core database side of the world tracking within the database, but then there were a lot of people looking at workflow systems,

and being able to track those workflow systems and how experimental results,

were produced.

And I think what you see now is, you know, you have a lot of focus on more of what I would call the workflow system style provenance, so being able to trace across your your organization. Right? So I think we've broadened out that that term, and I think what we're really interested in is, hey. I got this result.

How can I trace back across my whole data estate

to where the input results are? And we might use a data catalog for that. We might do some sort of structured logging

for that. But I think that's really, you know, going beyond just tracking within, you know, your relational database, for

And

in terms of your academic focus, you mentioned that you're working with the intelligent data engineering lab.

Obviously, a lot of evolution of data engineering happens rather organically in a large part, at least from

my understanding and experience, has been very industry driven. And so I'm curious if you can give some of the details in terms of the areas of focus and some of the ways that data engineering

can be conceived of as an academic pursuit and some of the interesting insights that you're able to gain by virtue of your research that are maybe translatable into some of the day to day of engineers working in that organizational context?

Yeah. So, actually, it's a pretty funny story about how I well, not a funny story, but a story about how I started IndieLab. So I was a academic. I worked

at the Free University Amsterdam,

and then I actually decided to go work, at Elsevier, which is a

large publisher information,

content

company, and I were was working in their research lab. And at Elsevier, I had a really a fantastic time, and it was really good

insights into

the practical day to days of managing

large scale datasets. What What are the problems that you see, information

silos,

the problems of semantics getting to an agreement in an organization?

And that's one of the reasons, actually,

after a couple years there, I just kept seeing these kind of fundamental problems that we continually

have. And could we go back and take a step back and go back and think about research and think about some of those fundamental problems over the long long horizon? And in particular for me, these ideas of

how do we do data integration better? How do we work? Do we have some, like,

primary mechanisms

to work with messy data? Can I tell you rigorously, here's how you do work with messy data? Here's your data integration set up. Can I do that? Can I have a a solution that I can say will work all the time? Can I have some

rules about how to do that? So this is one of the reasons I kind of went back into,

like, academia

because of that motivating factor of all these problems that we kept seeing

within, like, the the corporate context.

I think, in general, actually, there's a a

bigger conversation between,

you know, academia and industry in the field of data engineering or data management. Right? So, I mean, I think

if you're working with a SQL database, that's coming from a lot of research in

in,

you know, back of the papers from COD and IBM.

But, also, you see, like, this conversation between,

you know, at places like Sigma, between academics

and industry happening a lot. So I I think sometimes you see the motivations coming from industry problems and then, you know, academics taking a step back and and having a way to formalize those or come up with efficient ways to do that. Like,

I think maybe you've had on this thing, for example, the the emergence of column or databases would be an example of that kind of thing.

Specific to IndieLab, right, so what we're trying to do

is work on a number of different kind of thematic areas.

One is how to how do we automatically build databases or automatically build

knowledge graphs from multiple sources. Right? So can I come to you and say, okay? Here are ways given

heterogeneous data. Can I construct you a super high quality dataset

that's super useful?

Another place that we are working on is what we call context aware data systems.

So this is things like, how does your data management system cope

with different users,

different environments,

changes in the dynamics around this so you can think of data systems that are designed for digital twin? And lastly, like, I'm very

compelled by

data scientists and and data workers.

And so how can we design data management systems for helping with machine learning? So, for example,

how can we do

deeper data quality assessments? Right? So if you're building some data unit tests, can we tell you, you know, how that, would work? Or, like, a recent paper we had was

looking at the data handling impact

on machine learning models. So how do you change your datasets

when you're doing data prep? How does that impact your machine learning models? And it's kind of surprising. It's more it's maybe what you would expect that it's more than you than you think it would be. Right? So your machine learning models can be very impacted by your data handling and sometimes they're not the way you would expect. Those are some of the areas we're working on, and I think they have messages for,

for people in in practice.

There are a lot of interesting things there to unpack, in particular that context aware

data systems. One one of the things that immediately came to mind is the challenges of managing the nuance of access control in data warehouse environments, etcetera, where

a lot of the suggested best practice is to use things like attribute

based access controls. But then the challenge is, okay. Well, how do you gain access to the appropriate attributes? How do you determine that mapping? And how does that propagate

across a constantly shifting set of data models

where in the warehouse context, obviously, you want to make sure that you have consistent naming, consistent tagging, but there's typically not a broad guarantee that those are all in place in the way that you want them to be. And so there are

numerous points at which you can accidentally leak access because you don't have the right attributes or because you haven't explicitly added the appropriate row level permissions or column level permissions. And I'm wondering what your thoughts are, if there have any been any areas of your research that touch on some of those complexities

of identity and access control and managing

the contextual elements of who should be able to do what and for what reasons.

Yeah. So we haven't worked a lot on access control per se, but what we do see is exactly what you see is this divergence between,

I would call it semantics. Right? So what do we call different attributes in different spaces. Right? So you see this all the time with things like the idea of customer versus person, right, or customer versus

user. Right? Is your user your customer? What what's the role there? Or are you a sysadmin,

or are you an administrator?

Are you a steward? What are all those called? And, essentially, the proliferation

of data models

is is what it is. And we have these kind of classic things where we have everybody has their own data model, and we go to the data lakes view of the world. And then we have different conversations about, oh, we need to data warehouse it so that we all agree on things, but then nobody ever so I think this is where I think the future is going is much more I can adapt to any underlying

data model for the application.

So I can tell you if we're talking about the security case,

actually, what you mean in this context

is this person with this access control review. Right? So and I think this is where a lot of where we're gonna go in in data management systems is this adaptability,

and that's maybe something we'll talk about later, like the use of AI to help you do that adaptation

because it was very hard to even in an organizational context, to enforce data models. I don't know if you've had that experience in your in your environment or you've seen that in other conversations you had. But data model enforcement,

especially across different environments,

it's hard. Right?

Yeah. One of the

common refrains in that regard is the

conversation around master data management or ensuring that you do some of the entity disambiguation

to make sure that the person that you're talking about in this table is the same as the person that you're talking about in this table or the widget or what have you. And so then a lot of times, that brings up the conversation around, oh, well, just turn it into a knowledge graph with, you know, emphasis on just being a gross over oversimplification

of the effort involved.

And so, yeah, the the adherence to specific

domain models and specific semantics around those models is definitely, I think, one of the most long running and

fraught challenges

in data engineering as an exercise.

Exactly. And I think, like, one of the things I talk about when we talk about knowledge graphs is I always say these are great quality datasets. I don't know if that's actually true. But, like, one of the things you like about them is, oh, everything has a unique ID

and it's you know, we know exactly, you know, what we point to,

the data engineering podcast. It has the URL. It's uniquely identified, and we can describe it as a podcast. But if you actually look at any knowledge graph, you go out and look at Wikidata, it's there's still a lot of undismiguated

things. And even in knowledge graphs that are, you know, really rigorously

maintained, you always have this proliferation of identity.

And in data management, we're just constantly trying to to figure that out. So I think in some sense, you have to have this conversation in your organization about what are we gonna actually govern.

Right? What are we gonna enforce, and what are we gonna let be willy nilly? And I think, in general, the question is, can we design some newer techniques

that makes it easier to do those kinds of of mapping?

Digging a bit more into knowledge graphs

from looking at some of your publications and some of your profile information available on the web, it shows that you've spent a significant portion of your career invested in that particular

niche area.

It's a fairly large niche, but in the broad context, it is something that has typically been constrained to a group of people who are very enthusiastic about it. And then there is a long tail of people who are interested but don't take the time to actually invest in it. And

in the past couple of years in particular, there has been a huge growth and interest in knowledge graphs within the context of large language models, both for purposes of using the graph as a means of grounding and

refining the context for the models, as well as the ability of those models to be used for actually constructing the graphs out of messy, unstructured data as a means of very rapidly bootstrapping a knowledge graph effort.

But then there's also still the long tail of cleaning and curating and pruning that graph. And so I'm wondering if you can talk to some of the ways that you're seeing the overall industry adoption,

maybe retread ground that is already well known of certain dead ends, etcetera, or maybe some of the ways that this renewed interest is reinvigorating

that overall ecosystem of knowledge graphs as an area of both academic and industry pursuit?

Yeah. So I think, like,

you know, I think it's been pretty exciting about

the introduction of elements just for the construction of of knowledge graphs. Right? So one of the biggest problems in building knowledge graphs is usually what you're doing in you know, ten years ago, is you're taking relational databases, and you were converting them into graphs. And you were set writing some sort of mapping language

into

some sort of graph structure, whether it be an RDF or a cipher or whatever

modeling language you wanted, but you're writing mapping rules.

And then or a a big area of interest was using natural language processing to do information extraction,

named entity resolution, and relation extraction,

those kinds of things, but you were building pretty complicated

pipelines to get that done. And what we've seen is really the it's just so much easier to construct

those information extraction pipelines.

It's actually even easier to construct mappings.

Right? So if you have underlying relational databases, it's pretty easy to get models to create,

mappings for you to a graph. And I think that's super exciting. Right? So it's makes that process of building a graph a lot, lot easier where a lot of I think a lot of people got kind of hung up on it. Right? Because it was a big

it was a big ask. Right? So we when I was at Elsevier, we built

knowledge graphs. So I was on, you know, one of the one of the people who was building some of those first knowledge graphs, And we still have projects going on with them, and you see it's much, much easier for them to now

construct those models, do the integration. And what that means, I think, is

that we can start talking about,

you know, what are some of those benefits

from having

from creating a graph. And I I don't necessarily think it's having a graph database necessarily.

Although you just had a great talk with the guys from Kuzu DB, which I really enjoyed. But,

right, I think it's this recognition

is when you're building that graph, you start talking about

defining those semantics that we were just talking about. Right? What do we mean by person? What do we mean by podcast? What do we mean by episode? What do we mean by custom? And writing those down and and making those explicit. And I think, you know, I don't know if you're, you know, Juan Cicada from, data.world/service

now.

And, you know, we've written things together, and and we've been on this,

and and he, in particular, has been on this journey of kind of emphasizing the role that, you know, you wanna play in getting that agreement. Right? So that focus

there. So I think that's been very strong.

Another really important interesting thing is with LLMs

and the change is that we don't have to I think there was this idea

maybe a couple years ago that if I was gonna do a knowledge graph, everything had to be in the knowledge graph. Right? I had to convert all my data into a knowledge graph. And

that was never the case, but this was, like, the central dogma

that you kind of heard. And now what we've seen is this idea that, hey. You could put part of your data

in a graph where it makes sense, and we could connect out. We could connect out with with links to the underlying datasets. We can connect out with queries.

We can just leave data as

literal, so actually blobs

in the graph itself, and we just leave the data there.

And that's actually okay. And we can take advantage of the fact that we have

LLMs that are able to extract meaning from those literals without having to

construct everything. And I think this easy to construction also,

you know, looks at things like, hey. Actually, we can build a graph on the fly

if that's useful for my downstream task. And this is where you've seen things like Microsoft's

GraphReg. Right? So where okay. What they do is they construct a graph because it's helpful when we go out to prompt a a language model. So I think those are some of the, you know, the places where we see, okay, this I think maybe we're in the trough of what I don't know the these troughs or whatever things they could talk about, but, like, we're in a place where people are saying, okay. This is where a knowledge graph is useful in my whole

data engineering pipeline, in my whole data management system. It's not an all or nothing proposition.

I think to

one of the elements of the knowledge graph ecosystem

that maybe hampered its growth is the lack of unification around how to properly model and represent those graphs and then query them where there was the evolution from I think the earliest one was maybe OWLs,

and then SPARQL gained a lot of ground in the semantic web era, which a lot of people are still very interested in in and devoted to. And then Neo four j came out and popularized the property graph model, which simplified the overall means of constructing and interacting with the graphs, but there was still no standardized means of actually querying them. So you had Cypher from Neo four j. You had Gremlin as sort of the open approach. Now we have GQL as the standard track

syntax that is largely modeled after Cypher, but then you also have things such as d graph that leaned in on GraphQL as a query interface. And I'm wondering

what your thoughts are on some of the

ways that that lack of unification

and

lack of, I guess, settled

best practice within that ecosystem has maybe hamstrung its adoption and growth at least up until now.

Yeah. So I think one of my

one of the banes of my existence is this kind of the integration

between or the connection between the principles of a knowledge graph and the underlying

technology stacks you might need. So I think you can build a knowledge graph in a relational database. No problem. Right? There are some principles there. Okay. We're gonna have unique identifiers.

We're gonna have relations, connections, links between them. We're gonna have types. Right? Those are useful concepts to have,

and we do that all the time. If you stood up on a whiteboard, you draw nodes and edges and you and you draw attributes. It's kind of a a useful thing to do. And now we can use a different technology stacks to implement those. And this is something that I always try to convey to people. Maybe I need to do a better job or need to do more outreach. What I have seen, though, is, like, I'm very

excited about things like Amazon Neptune,

where what you see is, hey. If you need to be if you wanna query and cipher or you wanna have a property graph view of the world, if you wanna have more of a a RDF or semantic web or a triple view of the world, hey. You can have that under the sort of graph data that's there. Right? So this almost this independence

of that. And that's where things like,

you know, GQL are interesting where you have your relational database, and then you actually just, hey. If you wanna create that as a graph, here's how you do it. Here's are the nodes and here's are the edges technically in that language.

And so for me, that's kind of the exciting point because I think people are beginning

and maybe this is the result of all this proliferation, right, of different technology stacks. That's okay because people under understand

the underlying

importance of the concepts. And, okay, now we're debating what's your favorite syntax,

what's your fastest database, what is easiest to install,

what can I get on my cloud?

Those for me are

may maybe that's a good sign for the future, right, where you see vendors that are very much around developer experience

looking to use these concepts, but maybe in a way that's better built into

what you wanna use as a as a developer.

Moving back over

to LLMs and the impact that they're having on the practice

of

interacting with data and the purpose of data in a lot of cases,

I'm curious what you're seeing in your areas of research or some of the particularly

interesting or novel applications of LLMs to that data engineering context or ways that you're seeing it shift both the academic and industrial practices around data.

Alright. So, I think there's a lot, I think. So one place I I think is this idea of multimodal

data becoming

actually

integrated into your database. Right? So, essentially, having database operators

that are LLMs. And here, I'd point to something like Emmanuel Trumer's work on SwellDB,

or there's a extension to DuckDB called Flock MTL.

And what these do is they essentially

in your database query, you're writing your SQL query, you can just write what looks like a user defined function that's calling out to the LLM. And what's cool about that is you can have essentially the kind of

declarative SQL style queries, but now you could do that over

unstructured data. So kind of multimodal data really becoming a first class citizen,

text, images

directly in your database. And I think that might help us a lot because oftentimes, we've treated, you know, multimodal data,

images, or videos as a completely separate you know, you put that in your s three bucket. You have a link out to that. You load it in. And now you can just store that in your database, and you could do operations on it without necessarily pushing that to, you know, a separate pipeline. And I think that's pretty an interesting exciting thing.

Another

thing I think more

out there,

is this idea of large language models as databases

themselves.

Right? So

there's,

work by one of my colleagues in IndieLab, Jan Christophe Calo

and Pablo Pappetti

outside of Europe at Eurocom

and also, again, Emmanuel Trumer, where, essentially,

you treat the language model as a database. And you guys you're everybody's familiar with this. Language models know things. They have facts and data about the world. Do we necessarily need to build an actual database? Right? Can we just get our data directly from your language model? Right? And that actually solves some of the problems of data ingestion. Right? Because, essentially,

you're just your pretraining is almost your data ingestion, if you think about it that way. Now there's lots of interesting

ramifications

of that. Right? Can you trust the facts coming back from your LLM?

Do you wanna treat it as that? Right? What's the trade off for,

you know, keeping things in your LLM parameters versus actually

using a traditional database or even looking at unstructured content? Right? So I think that's kind of a a very exciting research angle. And lastly, I don't think it's necessarily

to do with LLMs per se, but it's the ramifications

of training on LLMs and the kinds of architectures that we're building. So, essentially, all the major clouds, everybody's

investing

massive amounts

of resources in building data centers

designed for ML systems, right, designed for

for training

large language models.

And that means the underlying systems that we have system architectures that we have, are changing. And so the data management systems that we have are changing to kind of give take advantage of those

those underlying changes in computer architecture.

So here I would point to,

Matteo Interlandi. He's at Microsoft Research,

and he's done work on essentially building a database on top of GPUs

by using PyTorch. So you have your SQL operators, your joins, your projections,

your selections,

and what they are is PyTorch functions

that you then compile down, and you get these massive boosts in performance.

Why? You have all these GPUs. Right? It's because people are buying loads of GPUs to put in these data centers. So I think this is something pretty exciting

in terms of, hey. Maybe we change the underlying architecture of our database to take advantage

of all of the investment around AI hardware.

Yeah. That's definitely an interesting aspect as well where we have had GPU powered databases,

again, as sort of a a niche effort for specific use cases, but they've typically been very expensive because they need a GPU, which is not cheap to operate. And another

area of computer architecture evolution is the idea of neuromorphic computing where you're trying to change from just being a straight Von Neumann architecture to using interconnects and architectures that are more akin to the way that the human brain operates as far as the highly paralyzed, highly connected

means of data interchange. Some of the, I guess, more, maybe, pedestrian ways of thinking about it are some of the work that AMD is doing where they're collocating their CPU on the same chip as the GPU to allow for higher parallelism and data interchange between them. There's a lot of work going into the network stack to be able to do direct networking from the CPU to the GPU

to cut out a lot of the bus transfer, etcetera. And I'm wondering, what are some of the conversations that you've had around how that maybe changes the ways that we even think about the role of a database in that context and the shape of what it can be or should be?

Well, I think what it means is, like I mean, the the conversations are, okay, we can put everything in well, we can put everything in memory, or we can drive a lot of things into memory. Right? Potentially, you can access some of the I mean, everybody if you're working with a MacBook, right, we all have massive amounts of memory that's available both to your GPU and to your CPU,

and you can actually load

your databases

completely in memory, right, for the most part. Right? So, in fact, having single databases so this was the thoughts behind my,

you know, my,

the folks that live across the street at CWI

in the creation of DuckDV

is, look. Actually,

you can have columnar databases,

one, that look like SQLite,

but because of the changes in computer architecture, we just have so much more RAM that we can actually deal with. And as you said, this kind of idea of, hey. We have so much memory bandwidth.

We can use these different kinds of chips on

or different cores on your processor to do different data management function. And I think what that means for us as you know, if you're a data management practitioner

is maybe

you don't have to be in a cloud database

right away. Right? So maybe that's the that's the the kind of message. Right? Or maybe we can have pipelines where we can just autoconstruct a database right away, put it up really quickly,

do some things to it, shut it down, and maybe shift it across the network. Right? So I think that's what I mean when we have to think about the architectures that are available. Right? So that change how we do this kind of data management practice.

Absolutely. Yeah. It brings up one of the

sort of tribal

knowledge elements

in the data industry of the idea of data gravity where, oh, you just wanna ship your compute to your data because it's too expensive to move the data anywhere else. With the introduction of things like DuckDB, KuzuDB,

these very lightweight, easy to construct and throw away data stores for being able to do high performance

compute and querying on them. There has been

a massive sea change in terms of how people think about their data architecture where maybe they do still have a massive central repository

of data, but not all of the computation happens across that repository.

And so there has been a a massive growth in edge compute or moving data to the edge for operation because then it reduces round trip latencies.

And the the that's, again, something that people have been doing for a while, at least in some small subset of cases, but it it was generally a more specialized thing of, oh, I need to optimize this specific thing, and so I'm doing this edge computing to deal with this one case. But now it's becoming a more generalized pattern because the tooling and capabilities

have expanded to incorporate that. And the idea of data federation with these lake house architectures as well means that the data doesn't even necessarily need to be situated in one location if you can massively parallelize the access

of that underlying data being able to do more of the push down of the query pruning to say, oh, I actually only need to select this subset of data. I don't need to scan absolutely every file for everything. A lot of those,

optimizations, I think, are definitely very interesting and changing the ways that we manage our data architectures, especially as the scope of interaction

with those datasets

grows as well, where largely it was very centralized. You had somebody who leaving out the sort of hyperscalers of, you know, Facebook, Google, Amazon, etcetera. For any given company, they would largely have a fairly

geographically constrained audience, and so they didn't necessarily need to optimize for federation.

But as

the availability

of Internet, the availability

of compute in multiple locations, obviously, with issues around data sovereignty, that changes the ways that we even need to be thinking about architecting our systems for various reasons. And so I think that's another motivation for broadening the architectural

principles of how we think about what data

operations need to be done where and by whom.

Exactly. Right. And, you know, I'm in Europe. So, right, we think a lot about, okay. Do you do you have data liability? Do I even want this dataset? Right? Was it better to have it on the person's local device and maybe I'm sourcing data from, somewhere else? And, actually,

given the power of, you know, edge devices

and the ability for us to have things like

rendering

at Wasm. Right? I could compile things in a browser, run them really fast. I can pull down necessary data from a central's place, but I can leave right personal data with the person. So I think those are kinds of architectures that we're gonna see more and more of, as you said. And I think it's kind of exciting given the capacity of the edge device.

And then another aspect that we've touched on a little bit is this juxtaposition

of

data engineering in the context of an organization where I'm focused on the needs of a particular business and the data assets that they need to be able to introspect their operations or provide services to their customers,

And then

the data management approaches and requirements

in research and academia

where you're largely focused on either creating or curating datasets, and then a lot of the data management effort goes into how do I make sure that that this data asset is reproducible

for use by other people, either for derivative research or for being able to replicate my research. And I'm wondering if you can give some of your perspective

on how you're seeing the what what you see as the various areas of overlap and

disjoint practices

across those two arenas.

Yeah. I think the biggest thing is that we don't have any top downness

in, in research context. Right? So

in business context, you may not think that you have top downness, but you have a little bit. We can come up with data models. Maybe we have lots of them, but we can come up with data models that we agree on. Right?

Doing

agreement on data models in research is very hard. People try to do it. Right? There's standards bodies to try to do this research, building biomedical

ontologies, for example. But it's a really,

a much more difficult process

and a process that we don't often even do in research. Right? So in some sense, one of the nice things in business is you actually you do have a customer. Right? You either have an internal customer who you're trying to design a data set for, or you have, you know, the the end customer of the business where you could think about, okay. How am I designing my data management system for serving that? Whereas in research data, we have these kind of more broad ideas of, hey. I wanna put this dataset online, and hopefully somebody will reuse it. Right? And there's a big conversation that's about, hey. How much should you invest in actually creating

all the metadata so that somebody could potentially, in the future, pick up your data and reuse it. And so one of the things I'm really excited about coming back to our conversations about LLMs

is actually, can we use

LLMs to help us create

metadata

on the fly

for in research so that new users can come in and actually understand your data and maybe use it for my domain. So an interesting example is, I was just talking to

a researcher in

construction

materials.

Right? And she was talking about, okay. I have a dataset around that is really focused on the material properties of this particular kind of concrete. Actually, that dataset might be useful

for somebody for another scientist looking at life cycle

optimizations

of construction.

Now those two scientists have different vocabularies.

Now can we use an LLM to bridge that to that scientist? Now those kinds of conversations,

I think you have a little bit in business, but in research, it's just doubly so. Right? The vocabularies are even farther apart. Right? The semantics are either farther apart. And so bridging those are, I think, one of the biggest challenges that we have in doing research data management. That brings up a interesting conversation I had. I forget exactly with whom it was,

but

the overall gist of it was that the introduction of LLMs

as a utility

in the process of data management

almost

necessitates

the inclusion

of more external datasets

within a business context because

you have the ability to do that. And by virtue of using

external third party datasets, you are able to enrich the decision making that you're doing because you have a much broader context, a lot more information than just whatever first party data you're able to generate or collect. And I think that's an an interesting parallel of within research being able to bridge across research domains because of the variance in the vocabularies

used. And I think that that also is applicable in that external dataset

use case where whoever created or curated that dataset or whoever is selling that dataset has a particular set of vocabularies.

They're probably targeting a particular set of industries, and so maybe that constrains who their focused target audience might be. But because LLMs can do some of that semantic mapping for you

to translate into your specific business domain and vocabularies,

it broadens the set of external assets that you might be able to incorporate

and actually derive value from.

Exactly. And I think one of the interesting

knock on effects of exactly that is a little bit coming back to the idea of, like, large language models as a data database. Right? So I think then you start to think about, oh, do we need to actually manage our

large language models and the accesses to large language models

as an actually a data asset? Right? So are we just using it for

its capabilities,

or are we actually using

the information in the LLM? And if I'm doing that, then maybe I need to start doing things like data versioning,

having the right metadata about it, knowing if we have the right licenses,

figuring out the data lineage

that actually goes not that we just use this component, but actually we're sourcing information from that component. And I think this is something that if you're in this space, you're gonna have to think about as a as a data management practitioner.

Yeah. One of the interesting

reductive summaries of large language models that I've come across is the idea that they are effectively

a a very sophisticated lossy compression algorithm.

Yeah. And so in your experience

of working

in this space of academic research

on data management with this focus on machine learning use cases and the growing bidirectionality

of that, what are some of the most interesting or innovative or unexpected

ways that you have either seen LLMs applied in that context of data management as either a receiving end or a producer or just some of the interesting areas of research that either you or some of your colleagues are focused on that you think are worth highlighting for the audience?

I think, for example,

having point cloud

databases.

Right? Actually, I was just recently at Sigma. They had essentially

you now can use point clouds and do, like, full structured queries across

point clouds. Right? And also using these kind of techniques,

LLM techniques.

Or I saw one streaming

database

where they're taking datasets coming across

from a from a health care dataset

or a health care situation where you have sensor data coming in

or and they had

health care records of the person, and they were streaming that live and then using LLMs as operators

to work over that dataset. So it really comes back to this kind of I've embedded the LLMs into the operators of the database. I I think that's been, like, very cool for me.

And in your experience of working in this space and doing this academic

research on data management and its intersection with these ML use cases, what are some of the most interesting or unexpected or challenging lessons

that you've learned personally? Yeah. So I think the number one thing that I think I learned in industry, but I keep learning it every day, is that real data is

is always surprising. Right?

The particular power of workloads figuring that out. So I'll give you a couple examples. Right? So just recently, like,

one of my PhD students,

David Jackson and doctor Hazar Hamuch,

we built a knowledge graph, a database on bioactive,

compounds from the literature. So it's called

BaselDB.

And it was just very interesting to see, like,

how

anamorphosis

data is around what constitutes bioactivity,

how can we, you know,

take data

from publications and kind of translate it into a core database.

We've worked with,

with another PhD student of mine, Trissel

Libertore.

She actually built what we call fashion,

DB,

which is a knowledge graph of fashion data. Right? So actually do using LLMs to

extract things like the context of fashion,

how it was used,

the different seasons, connecting that together because we're using that actually for innovation studies

in in fashion.

But just that real world data, every time we look at real data,

it's always a mess,

number one. And it's always super interesting because you just people don't do what you tell them to do in database one zero one. So I think that's always surprising. So it's also one of the reasons, like, I really like working with

companies and real organizations and building real datasets ourselves is that you get that experience of what people

are actually doing when they create datasets. Right? So the Excel spreadsheet problem just persists,

and it persists at every scale. Right? So that's always

super challenging, but super fun.

And as you continue

to

invest in researching this constantly shifting ecosystem?

What are some of the areas of focus that you are interested in for the near to medium term?

Right. So one of the big things is we were talking a little bit about,

large language models as databases.

How can we,

constrain

the information coming about from those, and how can we make sure that it's factual

to more or less a degree? We have a recent publication out in that will come out in a conference on neurosymbolic

AI

later this year. That's exactly about this. Right? So how can we make sure that the facts that we get out of large language models are facts that we can use or at least we can say

we can give some evidence

for or against those those facts. So I'm excited about those.

I'm we talked in the very beginning about this idea of flexible data integration. So

the idea of, hey. I have a completely new data model.

What's your under underlying data, like, look like? What does your underlying data state look like? Can I

automatically

populate that new data model? And can we really drive down the cost of, yeah, I have a new view of the world. I have a new set of

semantics. I have a new data model. Can we auto populate that? And can we have confidence in that auto population? So automated generation of mappings, automated information extraction.

So that I'm super excited about. And lastly, I think coming back to the idea that, you know, people are surprising and real world data is surprising, I think our data engineering pipelines,

we've always had some human in the loop. We've called them annotators or we've called them crowd, or we've called them experts back in the day. But I think

this idea of data engineering pipelines as the combination

of human AI

and technical systems together, I think that only becomes more important as we kind of move up the stack with our data engineering pipeline. So those are some of the areas that I in the kind of medium term, we've been looking at and are pretty excited about.

Are there any other aspects

of your areas of research focus or the overall impact of LLMs

on the data engineering ecosystem

or any of the other myriad topics that we touched on that we didn't discuss yet that you'd like to cover before we close out the show?

No. I think I I think that that was a really great conversation. I really enjoyed that. So

Alright. Well, for anybody who wants to get in touch with you, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think the biggest gap right now is the fact that it's

difficult

to choose

one

technology stack. I mean, it's not tooling, but,

we didn't talk about it, but I also,

as a side project, I have a start up company called Longfork dot ai.

And one of the interesting things about that has been

how do we choose the right technology stack.

And I think this is

really, really challenging. I think there's a lot of taste. It keeps moving.

This is one of the reasons why I enjoy your podcast so much, Tobias,

is that, you know, it keeps me up to date on the the changing nature of data management.

But helping people really

choose a technology stack that's right for them

is exceedingly difficult. We talked about that with respect to knowledge gaps, but you see it in everything from workflows

to,

which cloud service I should use. I think

having some way to do that, maybe we'll never get there, but I think this is the biggest challenge. Which technology stack should I should I use for my prong?

Yeah. It's definitely a constantly moving target. And

for

a certain period in the late nineties, early two thousands, you had very vertically integrated

providers where it was, oh, well, you just go and use Informatica

or pick your provider.

And then

in the

late twenty tens into the beginning of twenty twenties,

we had the growth of the, quote, unquote, modern data stack of, oh, just pick whatever you want, and then you just cobble them all together. It'll be fine. And then everybody realized, actually, that's a huge amount of work and really painful to deal with.

And so I think now we're starting to move back into another area of consolidation

where people are composing their own opinionated,

vertically integrated

stacks out of a grab bag of different technologies

to say, we know that this is really painful. We're just going to do this part for you. Just buy our product, and it'll be amazing

until you start to hit against the boundaries of that.

So I have one last thing if we have a little time. Yeah.

And I don't know where you put this, but I have a request for your,

for your listeners. So we teach a lot of bachelor students in data management.

If they can tell us one thing, if they send me an email, it would be great. One thing that we should teach our students

coming out of bachelor computer science, curriculum, that I would love to know. This is always interesting, and I'd love to hear from any of your listeners who have thoughts or opinions on that. Absolutely. And yeah. So for anybody who does have opinions that want to,

send them along to Paul, his contact information is in the show notes. So with that, I thank you for taking the time today to to join me and share your thoughts and experiences

on the areas of research that you're focused on as well as pontificating

on the overall ecosystem.

It's been a very, enjoyable conversation. I appreciate the time and energy that you and your group are putting into helping to gain more insight into this constantly shifting space. So, thank you again, and I hope you have a good rest of your day. Yeah. Thanks a lot, Tobias.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host sift data engineering podcast dot com with your story.

Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast