Summary
In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.
Announcements
Parting Question
In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.
- Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloads
- Introduction
- How did you get involved in the area of data management?
- Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?
- What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?
- What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?
- Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?
- Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?
- What are the foundational architectural modifications that you had to make to enable those capabilities?
- For the vector storage and indexing, what modifications did you have to make to iceberg?
- What was your reasoning for not using a format like Lance?
- For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?
- What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?
- What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?
- When is Starburst/lakehouse the wrong choice for a given AI use case?
- What do you have planned for the future of AI on Starburst?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- Starburst
- AWS Athena
- MCP == Model Context Protocol
- LLM Tool Use
- Vector Embeddings
- RAG == Retrieval Augmented Generation
- Starburst Data Products
- Lance
- LanceDB
- Parquet
- ORC
- pgvector
- Starburst Icehouse
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Poor quality data keeps you from building best in class AI solutions. It costs you money and wastes precious engineering hours. There is a better way. Core signal's multi source enriched cleaned data will save you time and money. It covers millions of companies, employees, and job postings and can be accessed via API or as flat files. Over 700 companies work with Core Signal to develop AI solutions in investment, sales, recruitment, and other industries. Go to dataengineeringpodcast.com/coresignal and try Core Signal's self-service platform for free today.
This is a pharmaceutical ad for SOTA data quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of undiagnosed data quality syndrome, also known as UDQS. Ask your data team about SOTA. With SOTA metrics observability, you can track the health of your KPIs and metrics across the business, automatically detecting anomalies before your CEO does. It's 70% more accurate than industry benchmarks and the fastest in the category, analyzing 1,100,000,000 rows in just sixty four seconds. And with collaborative data contracts, engineers and business can finally agree on what done looks like, so you can stop fighting over column names and start trusting your data again. Whether you're a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you.
Side effects of implementing soda may include increased trust in your metrics, reduced late night Slack emergencies, spontaneous high fives across departments, fewer meetings, and less back and with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a 1,000 plus dollar custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow SOTA's launch week, which starts on June 9. Your host is Tobias Macy, and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloads. So, So, Alex, can you start by introducing yourself?
[00:02:11] Alex Albu:
My name is Alex Albu. I I've been, with Starburst, for about, six years now, and I currently am, the tech lead for our, AI initiative.
[00:02:23] Tobias Macey:
And do you remember how you got started working in data?
[00:02:27] Alex Albu:
Yeah. I I come from a software engineering background. But, a few jobs ago, I was working for, IDEX, working on their veterinary, practice, software. And, we we had to build a few ETL pipelines, pulling in data from various practices into into IDEX. And I I think maybe the the point where I really got into data engineering was, when I, was working on, I on rebuilding an ETL system based on Hadoop. And I I I replaced that with a spark based system. And the the results were actually pretty spectacular. The I think, performance gains were let us go from, running a five node cluster, like, twenty four seven to, like, a smaller three node cluster that was just running a a few hours a day. So so that got me into into big data.
And, when I moved on to my next job at, at a company called TraceLink, I, I built there an, an analytics platform, using, using Spark for detail and Redshift for querying data. And we started, running into limitations of Redshift at that point. Started looking at, other solutions for analytics, and we came across Athena for, for querying data that, we are dumping into a data lake. I I thought this was a great solution, until, again, we we started using it for more serious, use cases, and, I started running implementations. And as I was, researching, you know, ways to optimize my queries to at at that point, you couldn't even get a query plan from, from Athena.
But, my research took me to, to Starburst. And and that's that's basically how I ended up at, Starburst. And at Starburst, I've been, you know, I had kind of a nonlinear trajectory. I started as a software engineer, Then I I took on, an engineering management job for a few years, for about four years. And now I'm back, as an IC working on, on the AI initiative.
[00:04:52] Tobias Macey:
And for people who want to dig deeper into Starburst, Intrino, and some of the history there, I'll add links in the show notes to some previous episodes I've done with other folks from the company. But for the purpose of today, given the topic of AI on the lake house and some of the different workloads, I'm wondering if you can just start by giving a bit of an overview of some of the ways that the lake house architecture, the Lakehouse platform intersects with some of the requirements and some of the use cases around AI workloads.
[00:05:25] Alex Albu:
Yeah. So as part of the AI initiative, you know, we like, we started this with two goals in mind. One was to make Starburst better for for our users by using AI, and the other one was to help our users build their own AI applications. And so what what does making Starburst better for users? You know, that covers a a wide range of of things. But, one of the things, we've done was, build an AI agent that allows users to explore their data using a conversational interface. And the sort of the the central point of that is, is is data products, which are curated datasets where users can can add, descriptions and, make them discoverable.
And and so we we wanted to take that a step further, and we've added a workflow that, allows users to to use AI to, generate some of this metadata to enrich these data products. And then users can can review these these, changes, these, this new metadata that was created. And, and they they have they have the possibility to, to correct or or add to, what the machine did. So then what this gives gives users is not just better better better documentation and, you know, help have some have some understand and discover their data, but it also enables an agent the agent that we created to to be able to answer questions and and gain, deeper insights, into the data instead of just just letting it look at at schemas. Right? We do we do other things with, with AI to make, make our users' lives easier. Like, for example, we we've we've had for a while, auto classification and tagging of of data. Right? We're using LLMs to mark, for example, PII, columns.
And, we're we're also looking at using AI for for things that are sort of more behind the scenes, like workload optimizations and and, you know, like, analyzing, query plans and decide you know, using that as as input for our work on, on the optimizer, producing, recommendations for, for users, for to make their to write their queries in a more efficient way. So and then the other direction that that we've been taking with AI was helping our users, employ AI in their own applications. And so what we did for that was, we built a set of SQL functions, right, that that users can invoke from their queries and that give them access to to models that they are, able to configure in Starburst.
So as with, you know, for for users who for for listeners who are not, very familiar with Starburst, I'll say that one of our tenets is is, optionality. So we allow our, users to bring their own back ends to query, write their own storage, their own, access control systems. So we we pro we provide flexibility at every step of the way. And AI models are no different. We allow users to configure, whatever elements they they want to use. Well, you know, from a from a from a set of supported supported models, obviously. But the the key is that they can they can use, you know, a service like, Bedrock, or they can run an LLM on prem in an air gapped environment. And we we support, those, those, scenarios. And and, essentially, what these functions allow you to do is express basically implement the rack workflow in a in a query. Like like, truly, like, you can you can generate embeddings, you can do a vector search, and you can feed the results as, as context to an LLM call and get your result back. I think it's the easiest way to to basically get started with with Nela. You don't need to know anything about APIs.
Don't need to know Python. Nothing.
[00:09:59] Tobias Macey:
And another interesting aspect to the overall situation that we're in right now with LLMs and AI being the predominant area of focus for a large number of people is that there are a number of different contexts, particularly speaking as data practitioners, where we want to and need to interact with these models, where you need to be able to use them to help accelerate your own work as a data practitioner in terms of being able to generate code, generate schema diagrams, generate SQL, but you also need to provide datasets that can be consumed by these LLMs for maybe more business focused or end user focused applications.
And I'm wondering, particularly for some of those end user facing use cases, what you see as some of the existing limitations of warehouse and lakehouse architectures
[00:11:07] Alex Albu:
typical datasets that, that LLMs that users will want to feed to LLMs. So so for example, you know, a lot of work that users do with AI models is is on unstructured data. You know, like, it's cell spreadsheets or, video image data, stuff like that. And that's that doesn't fit well in into a into a warehouse typically. Right? Other other there are other other areas where, things may not be ideal. So for example, you you were mentioning, you know, like, query generation, things like that. You you need good quality metadata in order to, to be able to generate, accurate queries. Right? And a lot of times, just just the schema reading the schema of, of of a table or of a set of tables is is going to be insufficient.
So there are some some limitations around the metadata that, typical warehouse or lakehouse will, will be able to expose. You know, we we say, here at Starburst that your AI is gonna be only as good as your data is, but maybe it's even more true that your AI is gonna be only as good as your metadata is at the end of the day. And then there are there are other aspects. So so for example, if you consider, training a model, providing it a training set. Right? So I I I think that here again, like, maybe warehouses are not going to be ideal in the sense that the the data access patterns that that you need when you train a model, like, where where you need to sample data, you know, specific datasets, that that that would be, an an access pattern that that would be typical for an NLM, while warehouses are optimized typically, right, for for aggregating data and, basically doing huge scans.
[00:13:05] Tobias Macey:
And then on the other side, for data practitioners who want to be able to use LLMs in the process of either processing data or iterating on table layouts or being able to generate useful SQL queries for doing data exploration? What are some of the current points of friction that you're seeing people run into?
[00:13:31] Alex Albu:
I think one classic one is, I think, around, data privacy and regulatory compliance. That's that's going to be challenging, especially in multitenant lake houses. There are there are I can tell you from experience that, many of our customers are have have pretty strict rules about what data can be sent to an LNM, and they're even stricter than, essentially the the rules on what, specific users can access. Right? So, like, it's possible that that the user is allowed to access, say, a column, but they don't want that to be sent to to an LLM. So that's, I that's where I think, a lot of the these these friction points are. Adelands can also struggle with with large schemas when when it comes to to query generation. And also, if large tables, the complex lineage, those are all problems for them.
[00:14:31] Tobias Macey:
And then as far as the interaction of being able to feed things like the schemas, the table lineage, the query access patterns into an LLM, generally, that would be done either by doing an extract of that information and then passing it along or using something like an MCP server or some other form of tool use to be able to instruct the LLM how to retrieve that information for the cases where it needs it. And that's generally more of a bolt on use case, whereas what it seems like you're doing right now with the Starburst platform is actually trying to bring the LLM into the context of the actual execution environment. And I'm wondering what are some of the ways that you're seeing that change the ways that people think about using LLMs to interact with their lakehouse data and some of the new capabilities or some of the efficiencies that you're able to gain as a result.
[00:15:36] Alex Albu:
Yeah. I think that's that's a that's a very pertinent con observation. So, I mean, I'll I'll say that, you know, the advent of MCP is is, is great. I think, you know, like, it it opens up a lot of data to, data sources to LLMs. You know, it's similar to how, Trino, you know, opens up like, has has all these adapters for other data sources, right, and and opens opens up access to access to to lots of data sources. But if you if you think about it, MCP is is, is defines a protocol for communicating with with a data source, but it doesn't really say anything about the, the data that's going to be exposed, right, or or the tools. Like, those are all left at the implementers' latitude.
And so I think, you know, the usefulness of MCP is going to depend on the quality of the the servers that are going to be out there. I I do I do I do think it's, it's it's going to be a very useful tool. But, like, with any tool, you have to you have to it has to be used for the right case, use cases. Right? So for example, you might have some data sitting in a Postgres database and some data sitting in an iceberg table. And you might have MCP servers that that provide you access, to both of those. And, you know, you're you're going to ask your LLM to or your agent to provide a summary of data, you know, like, that that requires joining the two datasets. Right?
I mean, I suppose it may be able to pull data from the two sources and join it, but that doesn't that doesn't seem right. Right? So so, you know, like, you could what we are proposing is go the other direction. Right? So you can using Starburst, you have access to both. Yeah. You can federate the data. You can join it, and then you can, you can gather the data and pass it through a SQL function to to the and and have it summarize it or whatnot. So I do see the what we're building as as complimentary to MCP if you want. Right? We're definitely also also considering building an, you know, like exposing an an MCP server. So but, again, it's, you know, use the right tool for the right job.
[00:18:08] Tobias Macey:
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
And digging into the implementation of what you're building at Starburst to bring more of that AI oriented work flow into the context of the actual query engine, this federated data access layer. I'm wondering if you can give a overview of some of the features that you're incorporating, the capabilities that you're adding, and some of the foundational architectural changes that you've had to make both to the Trino query engine and the Iceberg table format to enable those capabilities?
[00:19:15] Alex Albu:
So there are a few a few things. I I did mention, SQL functions. Right? Like, we we do we do have, sort of three sets of SQL functions that, that we provide. There are a few task specific, SQL functions that, basically have some can't some some predetermined, use some predetermined prompts to perform things like, summarization or, classification, sentiment analysis, things like that. We do provide, like, a more open ended prompt function that you can use to experiment with different, different prompts. You know, we we we think that they may not be opened up to the same groups of users. Like, the prompt function may may be, used by, you know, sort of maybe maybe the data scientists or, like, some somebody who who's more of a prompt engineer, while the the task specific ones, they don't really, require much background in in NLM, so it can be used by, a wider group of, users. And then the the category is, we we provide functions that allow you to generate to use, again, LLMs, generate embeddings, for RAG use cases. So one, one thing I I think I've mentioned, before is that, we allow users to configure their own, models.
And it's worth mentioning that you can you can generate you can, configure multiple models. We do offer quite a few knobs when you con configure a model, you know, not just, temperature and top t top p and other other parameters, but we also allow users to customize, prompts, use the my model because one thing we've learned is that there's a pretty wide behavior, gap between between them. And, the other the other thing, that we offer for models, is, governance. Right? So we do we do offer governance at at several levels. One is, you can obviously control, who can access specific functions.
But then, we take it a step further, and, the these functions actually, I should mention, take as one of the arguments, the the specific model that they are supposed to operate with. And so we do we do offer the possibility for an admin to control access to specific models. You know, it it does make sense to restrict, access to models that are very expensive. Right? Or so so that that gives gives, admins, data stewards, quite a few levers in in in configuring that. Being able to to provide governance at the at the model level has has has required, you know, like, is required a a bit of a novel approach in in the way we we have tackled governance in general access control.
Some some other some some other, things that that we're building here are model usage, observability. Right? So we, all users are very interested in being able to set, usage limits, you know, budgets, even controlling the bandwidth, you know, that that, specific user might might use up in terms of, you know, like, tokens per per minute or or so. And, I I did mention, the conversational, data interface that we've created. And we we we do think that, you know, like, sort of the differentiator there was building them around data products, which, you know, act as semantic layer and allow users to provide, in insights into the data that are, actually difficult to glean by by an LMM or even a human. You know? So as as far as architectural, changes, I I I did allude a bit to, governance, access control.
The the other area where we kind of have encountered, technical challenges was was in the area of sending higher volumes of data to to LLMs. Right? So so that LLMs are are fairly are fairly slow to respond. Right? So that's they they don't fit, very well with, processing large datasets. And so we're we're looking into into ways to, cost efficiently and be able to process large amounts of data. LLM providers do offer batch interfaces for such, use cases. However, the the challenge there is, integrating that with a SQL interface. These batch APIs are typically async, so they're not going to work well with, with a select statement. We we are considering, you know, a slightly different paradigm there. Some some some other things that, we've built is, or or are currently in the process of building, excuse me, our, auditing.
That's that's another that's another big component. And, some of our users actually require a fair amount capturing a fair amount of, of audit data for for their elements. So that's, it's again another challenge.
[00:24:56] Tobias Macey:
And then on the storage layer, a lot of people who are using Trino and Starburst are storing their data in Iceberg tables on object storage, and Iceberg as a format generally defers to Parquet or Ork as the storage layer. So there are a lot of pieces of coordination to be able to make any meaningful changes to the way that they behave. And I'm wondering, what are some of the ways that you have addressed the complexity of being able to store and access some of these vector embeddings living in lakehouse contexts and some of the reasons for sticking with iceberg for that versus looking to other formats like LAANC?
[00:25:42] Alex Albu:
So that's that's actually a very interesting question. So it it turns out that there are already discussions in the Iceberg community for supporting LAANC as, as a file format. And, we we are we are looking into into that. We're going to be working with the iceberg community, but definitely using an alternate, file format like LAANC is, is on the table. It's it's, it's it's an option that we, we evaluate. I'll also say that for smaller datasets, it's it's it's also possible to to store, data in in a different data source like, PG vector, for example. Right? So we do we do offer support for PG vector, so it's possible to to use that as a, as a vector database.
But, you know, like, there is a lot of interest in, in using installing storing data in Iceberg. And, you know, the, the the the right data format and the the right shape of the indexes that are going to be required for, efficiently, doing vector searches, doing semantic searches, That's, that's, very much an area that's, that's under sort of active, investigation and development.
[00:27:02] Tobias Macey:
And so for teams who are adopting these new capabilities as you roll them out, what are some of the ways that you're seeing them either add new workloads to what they've been using Starburst for or some of the ways that it's changing the ways that they use Starburst for the work that they were already doing?
[00:27:23] Alex Albu:
Yeah. So like I mentioned before, the the capabilities that we've added open up the possibility of, essentially building and running an entire work, a RAG workflow, in SQL. You can generate embeddings for for for data you have, and we have we do have, you know, sort of bulk APIs for for doing that. You can perform a semantic search. You can you can and and based on your top hits, you can, build your context and pass that to to another without, using a framework like langchain, for example. You know, I I think this this opens up new possibilities for for analysts who otherwise would, would probably not have gotten into, you know, gotten close to, to these capabilities.
You know, we you can you can you can imagine, you know, like, dashboards built based on, queries that employ, SQL functions.
[00:28:29] Tobias Macey:
And then in terms of the education around these capabilities, everybody is at a different stage of their journey of overall adoption of generative AI, LLMs, being able to bring it into the work of building their data systems. And I'm wondering how you're approaching the overall kind of messaging around the capabilities, rolling out the the the features, and some of the some of the validation that you're doing as you bring these capabilities to your broader audience and loop that feedback into your successive iterations of product development?
[00:29:08] Alex Albu:
Yeah. I I think that the way we ran this project, was to get feed customers in the loop quite early by, by doing demos of, essentially, unreleased software. Right? Like, we'd be our product would be demoing follow-up on builds to get feedback and, validate that we're on the right track and we're building, useful stuff for our, for our users. I do think that in in this area, the the best way to to document and and, sort of make customers aware of the value that, that we're providing is, by, by showing them, you know, sort of, like, small applications that you can build using this capability.
Right? So, like, you could you could envision, you know, like, say say, a database storing, restaurant reviews. You could you could essentially, you know, like, show show how how you can do a sentiment analysis on that and and then render that, you know, like, sort of the daily reviews as, you know, in a dashboard as, like, red, yellow, green bar charts. And you know what's cool about it is showing people how they how they actually can compose these functions. So, for example, if you had, restaurant reviews that that are in different languages, you could you could, translate them all to English, before doing sentiment analysis or or or summarizing, you know, sort of the the the general, maybe summarize the general complaints that that people might have or or things like that. So I think I think in in general, it's it's important to come up with some some good examples that that can highlight the truly the new capabilities that this, opens up.
[00:31:07] Tobias Macey:
In the work that you did to bring these AI focused capabilities into Starburst, into Trino, and tack them onto the iceberg capabilities, What are some of the interesting engineering challenges that you had to address, and what were some of the preexisting functionality that helped you in that path?
[00:31:30] Alex Albu:
Yeah. I I think that the the main preexisting functionality is, is the capability to access a wide variety of, of data sources. Right? Because the the name of the game here is getting your AI access to to the data. And in that, sense, Starburst is uniquely positioned to to to be able to to be plugged in into, all of the our users' data. And so, you know, making that available to to LLMs, you know, is is is challenging and is is an ongoing, effort. I did, I did mention how how, sending you know, processing, large amounts of data is, is is challenging from a technical perspective because it doesn't fit well, with, the the SQL paradigm.
But we we do have some some innovative, ways in which we are going to, to allow the that that sort of processing to be embedded in in, say, a workflow that our users, might have. It's, you know, like, some a few things that we've we've, we've learned along the way or that, essentially, working with LLMs is a paradigm shift. Right? So we we did learn that LLMs can be fickle, and, writing it, writing tests is is a is a real challenge. Being able to to to write tests, is is very challenging when your back end when the system that you're you're testing is probabilistic and is not deterministic.
Right? So you need to get to some extent into the mindset of, of a data engineer and, of of a data scientist, I'm sorry, and embrace, experimentation. Be ready to, I mean, everything here depends on, is is very data driven. Right? So generating, generating, meaningful datasets, is critical for building such a product. And, we're looking at various, various approaches for getting datasets that we can thoroughly test our models on, our our functionality.
[00:33:56] Tobias Macey:
And as you have been releasing these capabilities, onboarding some of your early adopters, what are some of the most interesting or innovative or unexpected ways that you've seen these AI focused capabilities applied?
[00:34:10] Alex Albu:
We're obviously at the beginning of the of, of of this journey. Right? Like, everybody like, the whole industry, I mean. Right? So we do have a few a few customers who who are are more advanced and, you know, they're we've seen them do some some interesting things. So for example, when when exploring different, datasets, right, coming from, say, different providers, they might be ingesting data from, they'll need to join these, these datasets, but they they won't necessarily have, similarly named columns. Right? So, inferring the say, the the join column is not not always easy based on just like doing a column name match. But with AI, you can actually do some semantic analysis and find the matching columns that way and and be able to join data, you know, figure out the join essentially or or have have the the machine figure out the join in ways that were not possible in, before. Some other interesting things that, we've seen our customers do is and this is actually something that we're also looking into is for internal use is, generating synthetic datasets, which removes the danger of, of using PII testing, PII data and testing and things like that.
So using LLMs to generate, synthetic datasets is, is another interesting, use case that we've seen.
[00:35:38] Tobias Macey:
For people who are interested in being able to use AI in the context of their data systems, what are the situations where either Starburst specifically or the lakehouse architecture generally are the wrong choice?
[00:35:57] Alex Albu:
So I think at this point, I I wouldn't recommend using Starburst if, for a user for use case that that uses data like, video or large blobs of unstructured data, you know, like one of our sensor data or or something like that. So we're not ready at this point to to deal with, with these types of, of data. Also, high high volume, high concurrency operations are, are not going to to fit well, in here. And, you know, a lot of it is is actually due to the performance of of LLMs and, you know, like, the lack of support for, for such, operations.
But, you know, again, as with and everything, you know, choose choose the right tool for the right task. And, you know, while, I think Starburst, can handle a lot of use cases, you know, it's definitely not going to handle all of
[00:36:54] Tobias Macey:
them. And as you continue building out these AI focused capabilities, the landscape around you continues to evolve at a blistering pace, What are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to explore?
[00:37:13] Alex Albu:
Yeah. You're you're right. This this is moving at a fast pace. Right? We we do have a lot of plans. We we do plan to add, MCP support. And we're thinking about how to make this the most useful working with with our customers. But you'll you'll notice that a lot of database out there just just expose a simple API to run run a query in in their native query language. I think we can we can do better. We can do more than that and, and allow agents to to use an MCP server to to automate a lot of the tasks users might might want to do, say, you know, in Galaxy, like spinning up clusters or setting up various resources. I think there there are lots of lots of opportunities there for, integrating with agents that, our users, have already built, actually. I I I did mention you you were asking about the the work, for storing vectors in, Iceberg.
We're that's definitely an area that, we're, building in, you know, more more essentially making making Starburst a more performant, vector database. And, you know, that that includes looking at, new, file types, for storing the data and looking at various indexes for vector indexes as well as, indexes for that's, support, full text search. We were also, you know, talking about some of the weaknesses that that I've mentioned for that that are common for warehouses and, where Starburst is no exception. I we we are going to be working, in that area to to provide, better support for, these, you know, data types that are currently maybe not not as well supported in, you know, like, PDF file, Excel spreadsheets, or whatever, like, sort of more unstructured, data, potentially, you know, like images and video.
So we're we're looking at extending Ice House, our managed platform, for data ingestion and transformation. And, we're looking at extending it to be able to ingest various types of data and generating embeddings for it, potentially applying AI transformations. Again, there are lots of, lots of possibilities that we see there. And we we do want to to continue, extending, the use of AI features throughout the product, you know, for things, that I've mentioned, I think, that would allow us to essentially make the product, more efficient and provide, recommendations to our users for improving their their queries and and the way that they can, use the data. And then finally, we're working on extending the agent that we built to to be able to generate graphical visualizations, you know, data explorations, and, and, you know, like, sort of make it the the way I I I think I personally think this is this is sort of the the way in which BI tools are headed. Right? So, like, allowing users to provide ad hoc explorations just using natural language, and visualizing the the results of their, their questions.
[00:40:50] Tobias Macey:
Are there any other aspects of the work that you're doing, the AI focused capabilities that you're adding into Starburst or just the overall space of building for and with AI as data practitioners that we didn't discuss yet that you would like to cover before we close out the show?
[00:41:08] Alex Albu:
No. I think we do. We discussed a fair amount of topics. We we we did cover a fair amount of of topics here. I think I think this it's it's definitely a a a very exciting area that that is going to, you know, like, be a game changer for for the way we interact with data, and, we we gain insights, from
[00:41:33] Tobias Macey:
it. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the Starburst team are doing. I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:41:53] Alex Albu:
Oh, tough question. I I think if I was to, to choose something, maybe it's a a lack of of, kind of of intelligent data observability and context understanding of of of your data. So I think I think, current tools still struggle at, well well, they're good at at, say, syntax validation, basic data profiling, they they still struggle at, understanding, you know, semantic understanding of of relationships, between data. And, incidentally, I think I think this is this is an area where where AI is going to be able to help and and, and and provide insights that were not achievable before.
[00:42:45] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you and the rest of the Starburst folks are doing on bringing AI closer into the process of building with and for AI and ways that we can use them to accelerate our own work as data practitioners and working with these large and complex data suites that we're responsible for. So I appreciate all the time and energy that you're putting into that, and I hope you enjoy the rest of your day.
[00:43:14] Alex Albu:
Thank you. I, thank I I really appreciate the opportunity to talk to you, and it was, it was a great conversation. Thanks.
[00:43:29] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Poor quality data keeps you from building best in class AI solutions. It costs you money and wastes precious engineering hours. There is a better way. Core signal's multi source enriched cleaned data will save you time and money. It covers millions of companies, employees, and job postings and can be accessed via API or as flat files. Over 700 companies work with Core Signal to develop AI solutions in investment, sales, recruitment, and other industries. Go to dataengineeringpodcast.com/coresignal and try Core Signal's self-service platform for free today.
This is a pharmaceutical ad for SOTA data quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of undiagnosed data quality syndrome, also known as UDQS. Ask your data team about SOTA. With SOTA metrics observability, you can track the health of your KPIs and metrics across the business, automatically detecting anomalies before your CEO does. It's 70% more accurate than industry benchmarks and the fastest in the category, analyzing 1,100,000,000 rows in just sixty four seconds. And with collaborative data contracts, engineers and business can finally agree on what done looks like, so you can stop fighting over column names and start trusting your data again. Whether you're a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you.
Side effects of implementing soda may include increased trust in your metrics, reduced late night Slack emergencies, spontaneous high fives across departments, fewer meetings, and less back and with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a 1,000 plus dollar custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow SOTA's launch week, which starts on June 9. Your host is Tobias Macy, and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloads. So, So, Alex, can you start by introducing yourself?
[00:02:11] Alex Albu:
My name is Alex Albu. I I've been, with Starburst, for about, six years now, and I currently am, the tech lead for our, AI initiative.
[00:02:23] Tobias Macey:
And do you remember how you got started working in data?
[00:02:27] Alex Albu:
Yeah. I I come from a software engineering background. But, a few jobs ago, I was working for, IDEX, working on their veterinary, practice, software. And, we we had to build a few ETL pipelines, pulling in data from various practices into into IDEX. And I I think maybe the the point where I really got into data engineering was, when I, was working on, I on rebuilding an ETL system based on Hadoop. And I I I replaced that with a spark based system. And the the results were actually pretty spectacular. The I think, performance gains were let us go from, running a five node cluster, like, twenty four seven to, like, a smaller three node cluster that was just running a a few hours a day. So so that got me into into big data.
And, when I moved on to my next job at, at a company called TraceLink, I, I built there an, an analytics platform, using, using Spark for detail and Redshift for querying data. And we started, running into limitations of Redshift at that point. Started looking at, other solutions for analytics, and we came across Athena for, for querying data that, we are dumping into a data lake. I I thought this was a great solution, until, again, we we started using it for more serious, use cases, and, I started running implementations. And as I was, researching, you know, ways to optimize my queries to at at that point, you couldn't even get a query plan from, from Athena.
But, my research took me to, to Starburst. And and that's that's basically how I ended up at, Starburst. And at Starburst, I've been, you know, I had kind of a nonlinear trajectory. I started as a software engineer, Then I I took on, an engineering management job for a few years, for about four years. And now I'm back, as an IC working on, on the AI initiative.
[00:04:52] Tobias Macey:
And for people who want to dig deeper into Starburst, Intrino, and some of the history there, I'll add links in the show notes to some previous episodes I've done with other folks from the company. But for the purpose of today, given the topic of AI on the lake house and some of the different workloads, I'm wondering if you can just start by giving a bit of an overview of some of the ways that the lake house architecture, the Lakehouse platform intersects with some of the requirements and some of the use cases around AI workloads.
[00:05:25] Alex Albu:
Yeah. So as part of the AI initiative, you know, we like, we started this with two goals in mind. One was to make Starburst better for for our users by using AI, and the other one was to help our users build their own AI applications. And so what what does making Starburst better for users? You know, that covers a a wide range of of things. But, one of the things, we've done was, build an AI agent that allows users to explore their data using a conversational interface. And the sort of the the central point of that is, is is data products, which are curated datasets where users can can add, descriptions and, make them discoverable.
And and so we we wanted to take that a step further, and we've added a workflow that, allows users to to use AI to, generate some of this metadata to enrich these data products. And then users can can review these these, changes, these, this new metadata that was created. And, and they they have they have the possibility to, to correct or or add to, what the machine did. So then what this gives gives users is not just better better better documentation and, you know, help have some have some understand and discover their data, but it also enables an agent the agent that we created to to be able to answer questions and and gain, deeper insights, into the data instead of just just letting it look at at schemas. Right? We do we do other things with, with AI to make, make our users' lives easier. Like, for example, we we've we've had for a while, auto classification and tagging of of data. Right? We're using LLMs to mark, for example, PII, columns.
And, we're we're also looking at using AI for for things that are sort of more behind the scenes, like workload optimizations and and, you know, like, analyzing, query plans and decide you know, using that as as input for our work on, on the optimizer, producing, recommendations for, for users, for to make their to write their queries in a more efficient way. So and then the other direction that that we've been taking with AI was helping our users, employ AI in their own applications. And so what we did for that was, we built a set of SQL functions, right, that that users can invoke from their queries and that give them access to to models that they are, able to configure in Starburst.
So as with, you know, for for users who for for listeners who are not, very familiar with Starburst, I'll say that one of our tenets is is, optionality. So we allow our, users to bring their own back ends to query, write their own storage, their own, access control systems. So we we pro we provide flexibility at every step of the way. And AI models are no different. We allow users to configure, whatever elements they they want to use. Well, you know, from a from a from a set of supported supported models, obviously. But the the key is that they can they can use, you know, a service like, Bedrock, or they can run an LLM on prem in an air gapped environment. And we we support, those, those, scenarios. And and, essentially, what these functions allow you to do is express basically implement the rack workflow in a in a query. Like like, truly, like, you can you can generate embeddings, you can do a vector search, and you can feed the results as, as context to an LLM call and get your result back. I think it's the easiest way to to basically get started with with Nela. You don't need to know anything about APIs.
Don't need to know Python. Nothing.
[00:09:59] Tobias Macey:
And another interesting aspect to the overall situation that we're in right now with LLMs and AI being the predominant area of focus for a large number of people is that there are a number of different contexts, particularly speaking as data practitioners, where we want to and need to interact with these models, where you need to be able to use them to help accelerate your own work as a data practitioner in terms of being able to generate code, generate schema diagrams, generate SQL, but you also need to provide datasets that can be consumed by these LLMs for maybe more business focused or end user focused applications.
And I'm wondering, particularly for some of those end user facing use cases, what you see as some of the existing limitations of warehouse and lakehouse architectures
[00:11:07] Alex Albu:
typical datasets that, that LLMs that users will want to feed to LLMs. So so for example, you know, a lot of work that users do with AI models is is on unstructured data. You know, like, it's cell spreadsheets or, video image data, stuff like that. And that's that doesn't fit well in into a into a warehouse typically. Right? Other other there are other other areas where, things may not be ideal. So for example, you you were mentioning, you know, like, query generation, things like that. You you need good quality metadata in order to, to be able to generate, accurate queries. Right? And a lot of times, just just the schema reading the schema of, of of a table or of a set of tables is is going to be insufficient.
So there are some some limitations around the metadata that, typical warehouse or lakehouse will, will be able to expose. You know, we we say, here at Starburst that your AI is gonna be only as good as your data is, but maybe it's even more true that your AI is gonna be only as good as your metadata is at the end of the day. And then there are there are other aspects. So so for example, if you consider, training a model, providing it a training set. Right? So I I I think that here again, like, maybe warehouses are not going to be ideal in the sense that the the data access patterns that that you need when you train a model, like, where where you need to sample data, you know, specific datasets, that that that would be, an an access pattern that that would be typical for an NLM, while warehouses are optimized typically, right, for for aggregating data and, basically doing huge scans.
[00:13:05] Tobias Macey:
And then on the other side, for data practitioners who want to be able to use LLMs in the process of either processing data or iterating on table layouts or being able to generate useful SQL queries for doing data exploration? What are some of the current points of friction that you're seeing people run into?
[00:13:31] Alex Albu:
I think one classic one is, I think, around, data privacy and regulatory compliance. That's that's going to be challenging, especially in multitenant lake houses. There are there are I can tell you from experience that, many of our customers are have have pretty strict rules about what data can be sent to an LNM, and they're even stricter than, essentially the the rules on what, specific users can access. Right? So, like, it's possible that that the user is allowed to access, say, a column, but they don't want that to be sent to to an LLM. So that's, I that's where I think, a lot of the these these friction points are. Adelands can also struggle with with large schemas when when it comes to to query generation. And also, if large tables, the complex lineage, those are all problems for them.
[00:14:31] Tobias Macey:
And then as far as the interaction of being able to feed things like the schemas, the table lineage, the query access patterns into an LLM, generally, that would be done either by doing an extract of that information and then passing it along or using something like an MCP server or some other form of tool use to be able to instruct the LLM how to retrieve that information for the cases where it needs it. And that's generally more of a bolt on use case, whereas what it seems like you're doing right now with the Starburst platform is actually trying to bring the LLM into the context of the actual execution environment. And I'm wondering what are some of the ways that you're seeing that change the ways that people think about using LLMs to interact with their lakehouse data and some of the new capabilities or some of the efficiencies that you're able to gain as a result.
[00:15:36] Alex Albu:
Yeah. I think that's that's a that's a very pertinent con observation. So, I mean, I'll I'll say that, you know, the advent of MCP is is, is great. I think, you know, like, it it opens up a lot of data to, data sources to LLMs. You know, it's similar to how, Trino, you know, opens up like, has has all these adapters for other data sources, right, and and opens opens up access to access to to lots of data sources. But if you if you think about it, MCP is is, is defines a protocol for communicating with with a data source, but it doesn't really say anything about the, the data that's going to be exposed, right, or or the tools. Like, those are all left at the implementers' latitude.
And so I think, you know, the usefulness of MCP is going to depend on the quality of the the servers that are going to be out there. I I do I do I do think it's, it's it's going to be a very useful tool. But, like, with any tool, you have to you have to it has to be used for the right case, use cases. Right? So for example, you might have some data sitting in a Postgres database and some data sitting in an iceberg table. And you might have MCP servers that that provide you access, to both of those. And, you know, you're you're going to ask your LLM to or your agent to provide a summary of data, you know, like, that that requires joining the two datasets. Right?
I mean, I suppose it may be able to pull data from the two sources and join it, but that doesn't that doesn't seem right. Right? So so, you know, like, you could what we are proposing is go the other direction. Right? So you can using Starburst, you have access to both. Yeah. You can federate the data. You can join it, and then you can, you can gather the data and pass it through a SQL function to to the and and have it summarize it or whatnot. So I do see the what we're building as as complimentary to MCP if you want. Right? We're definitely also also considering building an, you know, like exposing an an MCP server. So but, again, it's, you know, use the right tool for the right job.
[00:18:08] Tobias Macey:
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
And digging into the implementation of what you're building at Starburst to bring more of that AI oriented work flow into the context of the actual query engine, this federated data access layer. I'm wondering if you can give a overview of some of the features that you're incorporating, the capabilities that you're adding, and some of the foundational architectural changes that you've had to make both to the Trino query engine and the Iceberg table format to enable those capabilities?
[00:19:15] Alex Albu:
So there are a few a few things. I I did mention, SQL functions. Right? Like, we we do we do have, sort of three sets of SQL functions that, that we provide. There are a few task specific, SQL functions that, basically have some can't some some predetermined, use some predetermined prompts to perform things like, summarization or, classification, sentiment analysis, things like that. We do provide, like, a more open ended prompt function that you can use to experiment with different, different prompts. You know, we we we think that they may not be opened up to the same groups of users. Like, the prompt function may may be, used by, you know, sort of maybe maybe the data scientists or, like, some somebody who who's more of a prompt engineer, while the the task specific ones, they don't really, require much background in in NLM, so it can be used by, a wider group of, users. And then the the category is, we we provide functions that allow you to generate to use, again, LLMs, generate embeddings, for RAG use cases. So one, one thing I I think I've mentioned, before is that, we allow users to configure their own, models.
And it's worth mentioning that you can you can generate you can, configure multiple models. We do offer quite a few knobs when you con configure a model, you know, not just, temperature and top t top p and other other parameters, but we also allow users to customize, prompts, use the my model because one thing we've learned is that there's a pretty wide behavior, gap between between them. And, the other the other thing, that we offer for models, is, governance. Right? So we do we do offer governance at at several levels. One is, you can obviously control, who can access specific functions.
But then, we take it a step further, and, the these functions actually, I should mention, take as one of the arguments, the the specific model that they are supposed to operate with. And so we do we do offer the possibility for an admin to control access to specific models. You know, it it does make sense to restrict, access to models that are very expensive. Right? Or so so that that gives gives, admins, data stewards, quite a few levers in in in configuring that. Being able to to provide governance at the at the model level has has has required, you know, like, is required a a bit of a novel approach in in the way we we have tackled governance in general access control.
Some some other some some other, things that that we're building here are model usage, observability. Right? So we, all users are very interested in being able to set, usage limits, you know, budgets, even controlling the bandwidth, you know, that that, specific user might might use up in terms of, you know, like, tokens per per minute or or so. And, I I did mention, the conversational, data interface that we've created. And we we we do think that, you know, like, sort of the differentiator there was building them around data products, which, you know, act as semantic layer and allow users to provide, in insights into the data that are, actually difficult to glean by by an LMM or even a human. You know? So as as far as architectural, changes, I I I did allude a bit to, governance, access control.
The the other area where we kind of have encountered, technical challenges was was in the area of sending higher volumes of data to to LLMs. Right? So so that LLMs are are fairly are fairly slow to respond. Right? So that's they they don't fit, very well with, processing large datasets. And so we're we're looking into into ways to, cost efficiently and be able to process large amounts of data. LLM providers do offer batch interfaces for such, use cases. However, the the challenge there is, integrating that with a SQL interface. These batch APIs are typically async, so they're not going to work well with, with a select statement. We we are considering, you know, a slightly different paradigm there. Some some some other things that, we've built is, or or are currently in the process of building, excuse me, our, auditing.
That's that's another that's another big component. And, some of our users actually require a fair amount capturing a fair amount of, of audit data for for their elements. So that's, it's again another challenge.
[00:24:56] Tobias Macey:
And then on the storage layer, a lot of people who are using Trino and Starburst are storing their data in Iceberg tables on object storage, and Iceberg as a format generally defers to Parquet or Ork as the storage layer. So there are a lot of pieces of coordination to be able to make any meaningful changes to the way that they behave. And I'm wondering, what are some of the ways that you have addressed the complexity of being able to store and access some of these vector embeddings living in lakehouse contexts and some of the reasons for sticking with iceberg for that versus looking to other formats like LAANC?
[00:25:42] Alex Albu:
So that's that's actually a very interesting question. So it it turns out that there are already discussions in the Iceberg community for supporting LAANC as, as a file format. And, we we are we are looking into into that. We're going to be working with the iceberg community, but definitely using an alternate, file format like LAANC is, is on the table. It's it's, it's it's an option that we, we evaluate. I'll also say that for smaller datasets, it's it's it's also possible to to store, data in in a different data source like, PG vector, for example. Right? So we do we do offer support for PG vector, so it's possible to to use that as a, as a vector database.
But, you know, like, there is a lot of interest in, in using installing storing data in Iceberg. And, you know, the, the the the right data format and the the right shape of the indexes that are going to be required for, efficiently, doing vector searches, doing semantic searches, That's, that's, very much an area that's, that's under sort of active, investigation and development.
[00:27:02] Tobias Macey:
And so for teams who are adopting these new capabilities as you roll them out, what are some of the ways that you're seeing them either add new workloads to what they've been using Starburst for or some of the ways that it's changing the ways that they use Starburst for the work that they were already doing?
[00:27:23] Alex Albu:
Yeah. So like I mentioned before, the the capabilities that we've added open up the possibility of, essentially building and running an entire work, a RAG workflow, in SQL. You can generate embeddings for for for data you have, and we have we do have, you know, sort of bulk APIs for for doing that. You can perform a semantic search. You can you can and and based on your top hits, you can, build your context and pass that to to another without, using a framework like langchain, for example. You know, I I think this this opens up new possibilities for for analysts who otherwise would, would probably not have gotten into, you know, gotten close to, to these capabilities.
You know, we you can you can you can imagine, you know, like, dashboards built based on, queries that employ, SQL functions.
[00:28:29] Tobias Macey:
And then in terms of the education around these capabilities, everybody is at a different stage of their journey of overall adoption of generative AI, LLMs, being able to bring it into the work of building their data systems. And I'm wondering how you're approaching the overall kind of messaging around the capabilities, rolling out the the the features, and some of the some of the validation that you're doing as you bring these capabilities to your broader audience and loop that feedback into your successive iterations of product development?
[00:29:08] Alex Albu:
Yeah. I I think that the way we ran this project, was to get feed customers in the loop quite early by, by doing demos of, essentially, unreleased software. Right? Like, we'd be our product would be demoing follow-up on builds to get feedback and, validate that we're on the right track and we're building, useful stuff for our, for our users. I do think that in in this area, the the best way to to document and and, sort of make customers aware of the value that, that we're providing is, by, by showing them, you know, sort of, like, small applications that you can build using this capability.
Right? So, like, you could you could envision, you know, like, say say, a database storing, restaurant reviews. You could you could essentially, you know, like, show show how how you can do a sentiment analysis on that and and then render that, you know, like, sort of the daily reviews as, you know, in a dashboard as, like, red, yellow, green bar charts. And you know what's cool about it is showing people how they how they actually can compose these functions. So, for example, if you had, restaurant reviews that that are in different languages, you could you could, translate them all to English, before doing sentiment analysis or or or summarizing, you know, sort of the the the general, maybe summarize the general complaints that that people might have or or things like that. So I think I think in in general, it's it's important to come up with some some good examples that that can highlight the truly the new capabilities that this, opens up.
[00:31:07] Tobias Macey:
In the work that you did to bring these AI focused capabilities into Starburst, into Trino, and tack them onto the iceberg capabilities, What are some of the interesting engineering challenges that you had to address, and what were some of the preexisting functionality that helped you in that path?
[00:31:30] Alex Albu:
Yeah. I I think that the the main preexisting functionality is, is the capability to access a wide variety of, of data sources. Right? Because the the name of the game here is getting your AI access to to the data. And in that, sense, Starburst is uniquely positioned to to to be able to to be plugged in into, all of the our users' data. And so, you know, making that available to to LLMs, you know, is is is challenging and is is an ongoing, effort. I did, I did mention how how, sending you know, processing, large amounts of data is, is is challenging from a technical perspective because it doesn't fit well, with, the the SQL paradigm.
But we we do have some some innovative, ways in which we are going to, to allow the that that sort of processing to be embedded in in, say, a workflow that our users, might have. It's, you know, like, some a few things that we've we've, we've learned along the way or that, essentially, working with LLMs is a paradigm shift. Right? So we we did learn that LLMs can be fickle, and, writing it, writing tests is is a is a real challenge. Being able to to to write tests, is is very challenging when your back end when the system that you're you're testing is probabilistic and is not deterministic.
Right? So you need to get to some extent into the mindset of, of a data engineer and, of of a data scientist, I'm sorry, and embrace, experimentation. Be ready to, I mean, everything here depends on, is is very data driven. Right? So generating, generating, meaningful datasets, is critical for building such a product. And, we're looking at various, various approaches for getting datasets that we can thoroughly test our models on, our our functionality.
[00:33:56] Tobias Macey:
And as you have been releasing these capabilities, onboarding some of your early adopters, what are some of the most interesting or innovative or unexpected ways that you've seen these AI focused capabilities applied?
[00:34:10] Alex Albu:
We're obviously at the beginning of the of, of of this journey. Right? Like, everybody like, the whole industry, I mean. Right? So we do have a few a few customers who who are are more advanced and, you know, they're we've seen them do some some interesting things. So for example, when when exploring different, datasets, right, coming from, say, different providers, they might be ingesting data from, they'll need to join these, these datasets, but they they won't necessarily have, similarly named columns. Right? So, inferring the say, the the join column is not not always easy based on just like doing a column name match. But with AI, you can actually do some semantic analysis and find the matching columns that way and and be able to join data, you know, figure out the join essentially or or have have the the machine figure out the join in ways that were not possible in, before. Some other interesting things that, we've seen our customers do is and this is actually something that we're also looking into is for internal use is, generating synthetic datasets, which removes the danger of, of using PII testing, PII data and testing and things like that.
So using LLMs to generate, synthetic datasets is, is another interesting, use case that we've seen.
[00:35:38] Tobias Macey:
For people who are interested in being able to use AI in the context of their data systems, what are the situations where either Starburst specifically or the lakehouse architecture generally are the wrong choice?
[00:35:57] Alex Albu:
So I think at this point, I I wouldn't recommend using Starburst if, for a user for use case that that uses data like, video or large blobs of unstructured data, you know, like one of our sensor data or or something like that. So we're not ready at this point to to deal with, with these types of, of data. Also, high high volume, high concurrency operations are, are not going to to fit well, in here. And, you know, a lot of it is is actually due to the performance of of LLMs and, you know, like, the lack of support for, for such, operations.
But, you know, again, as with and everything, you know, choose choose the right tool for the right task. And, you know, while, I think Starburst, can handle a lot of use cases, you know, it's definitely not going to handle all of
[00:36:54] Tobias Macey:
them. And as you continue building out these AI focused capabilities, the landscape around you continues to evolve at a blistering pace, What are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to explore?
[00:37:13] Alex Albu:
Yeah. You're you're right. This this is moving at a fast pace. Right? We we do have a lot of plans. We we do plan to add, MCP support. And we're thinking about how to make this the most useful working with with our customers. But you'll you'll notice that a lot of database out there just just expose a simple API to run run a query in in their native query language. I think we can we can do better. We can do more than that and, and allow agents to to use an MCP server to to automate a lot of the tasks users might might want to do, say, you know, in Galaxy, like spinning up clusters or setting up various resources. I think there there are lots of lots of opportunities there for, integrating with agents that, our users, have already built, actually. I I I did mention you you were asking about the the work, for storing vectors in, Iceberg.
We're that's definitely an area that, we're, building in, you know, more more essentially making making Starburst a more performant, vector database. And, you know, that that includes looking at, new, file types, for storing the data and looking at various indexes for vector indexes as well as, indexes for that's, support, full text search. We were also, you know, talking about some of the weaknesses that that I've mentioned for that that are common for warehouses and, where Starburst is no exception. I we we are going to be working, in that area to to provide, better support for, these, you know, data types that are currently maybe not not as well supported in, you know, like, PDF file, Excel spreadsheets, or whatever, like, sort of more unstructured, data, potentially, you know, like images and video.
So we're we're looking at extending Ice House, our managed platform, for data ingestion and transformation. And, we're looking at extending it to be able to ingest various types of data and generating embeddings for it, potentially applying AI transformations. Again, there are lots of, lots of possibilities that we see there. And we we do want to to continue, extending, the use of AI features throughout the product, you know, for things, that I've mentioned, I think, that would allow us to essentially make the product, more efficient and provide, recommendations to our users for improving their their queries and and the way that they can, use the data. And then finally, we're working on extending the agent that we built to to be able to generate graphical visualizations, you know, data explorations, and, and, you know, like, sort of make it the the way I I I think I personally think this is this is sort of the the way in which BI tools are headed. Right? So, like, allowing users to provide ad hoc explorations just using natural language, and visualizing the the results of their, their questions.
[00:40:50] Tobias Macey:
Are there any other aspects of the work that you're doing, the AI focused capabilities that you're adding into Starburst or just the overall space of building for and with AI as data practitioners that we didn't discuss yet that you would like to cover before we close out the show?
[00:41:08] Alex Albu:
No. I think we do. We discussed a fair amount of topics. We we we did cover a fair amount of of topics here. I think I think this it's it's definitely a a a very exciting area that that is going to, you know, like, be a game changer for for the way we interact with data, and, we we gain insights, from
[00:41:33] Tobias Macey:
it. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the Starburst team are doing. I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:41:53] Alex Albu:
Oh, tough question. I I think if I was to, to choose something, maybe it's a a lack of of, kind of of intelligent data observability and context understanding of of of your data. So I think I think, current tools still struggle at, well well, they're good at at, say, syntax validation, basic data profiling, they they still struggle at, understanding, you know, semantic understanding of of relationships, between data. And, incidentally, I think I think this is this is an area where where AI is going to be able to help and and, and and provide insights that were not achievable before.
[00:42:45] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you and the rest of the Starburst folks are doing on bringing AI closer into the process of building with and for AI and ways that we can use them to accelerate our own work as data practitioners and working with these large and complex data suites that we're responsible for. So I appreciate all the time and energy that you're putting into that, and I hope you enjoy the rest of your day.
[00:43:14] Alex Albu:
Thank you. I, thank I I really appreciate the opportunity to talk to you, and it was, it was a great conversation. Thanks.
[00:43:29] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Alex Albu and Starburst
AI Workloads and Lakehouse Architecture
Challenges and Limitations of AI in Data Systems
Starburst's AI Integration and Features
Adoption and Impact of AI Capabilities
Engineering Challenges and Innovations
Future Plans and Developments in AI