AI and the Lakehouse: How Starburst is Pioneering New Workflows

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

11 June 2025

AI and the Lakehouse: How Starburst is Pioneering New Workflows - E468

0:00/0:00

Share on social media:

Description
Transcript
Chapters

Summary
In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.
Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloads

Interview

Introduction
How did you get involved in the area of data management?
Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?
What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?
What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?
Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?
Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?
- What are the foundational architectural modifications that you had to make to enable those capabilities?
For the vector storage and indexing, what modifications did you have to make to iceberg?
- What was your reasoning for not using a format like Lance?
For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?
What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?
What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?
When is Starburst/lakehouse the wrong choice for a given AI use case?
What do you have planned for the future of AI on Starburst?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Share on social media:

Listen in your favorite app:

Pick your app with Episodes.fm

More options

Amazon Music

Show RSS Feed

Click to copy to clipboard

Here are shows you might like

See show recommendations

AI Engineering Podcast
Tobias Macey

The Python Podcast.__init__
Tobias Macey

EPISODE SPONSORS Soda Data

This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit launch.soda.io to sign up and follow Soda’s launch week. It starts June 9th.

https://launch.soda.io/?utm_source=DEP&utm_medium=podcast&utm_campaign=Databricks+Launch

Coresignal

This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit coresignal.com to start your free 14-day trial.

https://coresignal.com/?utm_source=data+engineering+podcast&utm_medium=referral&utm_campaign=podcast+promo

Datafold

https://get.datafold.com/monitors-blog-de-podcast