Move Your Database To The Data And Speed Up Your Analytics With DuckDB

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

05 March 2022

Move Your Database To The Data And Speed Up Your Analytics With DuckDB - E270

0:00/0:00

Share on social media:

Summary

When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. DuckDB is an in-process database engine optimized for OLAP applications to speed up your analytical queries that meets you where you are, whether that’s Python, R, Java, even the web. In this episode, Hannes Mühleisen, co-creator and CEO of DuckDB Labs, shares the motivations for creating the project, the myriad ways that it can be used to speed up your data projects, and the detailed engineering efforts that go into making it adaptable to any environment. This is a fascinating and humorous exploration of a truly useful piece of technology.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Your host is Tobias Macey and today I’m interviewing Hannes Mühleisen about DuckDB, an in-process embedded database engine for columnar analytics

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what DuckDB is and the story behind it?
Where did the name come from?
What are some of the use cases that DuckDB is designed to support?
The interface for DuckDB is similar (at least in spirit) to SQLite. What are the deciding factors for when to use one vs. the other?
- How might they be used in concert to take advantage of their relative strengths?
What are some of the ways that DuckDB can be used to better effect than options provided by different language ecosystems?
Can you describe how DuckDB is implemented?
- How has the design and goals of the project changed or evolved since you began working on it?
- What are some of the optimizations that you have had to make in order to support performant access to data that exceeds available memory?
Can you describe a typical workflow of incorporating DuckDB into an analytical project?
What are some of the libraries/tools/systems that DuckDB might replace in the scope of a project or team?
What are some of the overlooked/misunderstood/under-utilized features of DuckDB that you would like to highlight?
What is the governance model and plan long-term sustainability of the project?
What are the most interesting, innovative, or unexpected ways that you have seen DuckDB used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckDB?
When is DuckDB the wrong choice?
What do you have planned for the future of DuckDB?