In this episode Kostas Pardalis talks about Fenic - an open-source, PySpark-inspired dataframe engine designed to bring LLM-powered semantics into reliable data engineering workflows. Kostas shares why today’s data infrastructure assumptions (BI-first, expert-operated, CPU-bound) fall short for AI-era tasks that are increasingly inference- and IO-bound. He explores how Fenic introduces semantic operators (e.g., semantic filter, extract, join) as first-class citizens in the logical plan so the optimizer can reason about inference, costs, and constraints. This enables developers to turn unstructured data into explicit schemas, compose transformations lazily, and offload LLM work safely and efficiently. He digs into Fenic’s architecture (lazy dataframe API, logical/physical plans, Polars execution, DuckDB/Arrow SQL path), how it exposes tools via MCP for agent integration, and where it fits in context engineering as a companion for memory/state management in agentic systems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/Build
- Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
- If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
- Your host is Tobias Macey and today I'm interviewing Kostas Pardalis about Fenic, an opinionated, PySpark-inspired DataFrame framework for building AI and agentic applications
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Fenic is and the story behind it?
- What are the core problems that you are trying to address with Fenic?
- Dataframes have become a popular interface for doing chained transformations on structured data. What are the benefits of using that paradigm for LLM use-cases?
- Can you describe the architecture and implementation of Fenic?
- How have the design and scope of the project changed since you first started working on it?
- You position Fenic as a means of bringing reliability to LLM-powered transformations. What are some of the anti-patterns that teams should be aware of when getting started with Fenic?
- What are some of the most common first steps that teams take when integrating Fenic into their pipelines or applications?
- What are some of the ways that teams should be thinking about using Fenic and semantic operations for data pipelines and transformations?
- How does Fenic help with context engineering for agentic use cases?
- What are some examples of toolchains/workflows that could be replaced with Fenic?
- How does Fenic integrate with the broader ecosystem of data and AI frameworks? (e.g. Polars, Arrow, Qdrant, LangChan/Pydantic AI)
- What are the most interesting, innovative, or unexpected ways that you have seen Fenic used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Fenic?
- When is Fenic the wrong choice?
- What do you have planned for the future of Fenic?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
- Fenic
- RudderStack
- Podcast Episode
- Trino
- Starburst
- Trino Project Tardigrade
- Typedef AI
- dbt
- PySpark
- UDF == User-Defined Function
- LOTUS
- Pandas
- Polars
- Relational Algebra
- Arrow
- DuckDB
- Markdown
- Pydantic AI
- AI Engineering Podcast Episode
- LangChain
- Ray
- Dask
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA