Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm.
Census is the operational analytics platform that syncs your cloud warehouse with all the SaaS applications used by your Sales, Marketing & Success teams. If you need to get your company data into Salesforce, Marketo, Hubspot, Intercom, Zendesk, and other tools, Census is the easiest way to do so. Just write SQL (or plug in your dbt models), set up the sync frequencies, and voila, your data is now available to be used by all of your teams. No need to worry about incremental sync, backfilling, API quota management, API versioning, monitoring, and maintaining custom scripts. Just SQL. Start your free 14-day trial now.
Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
RudderStack is the smart customer data pipeline. It takes the toil out of building data pipelines that connect your whole customer data stack. Its easy-to-use SDKs and source integrations, Cloud Extract integrations, transformations, and expansive library of destination and warehouse integrations makes building customer data pipelines for both event streaming and cloud-to-warehouse ELT simple. RudderStack’s warehouse-first approach and Warehouse Actions functionality makes your customer data stack smarter by enabling analysis and modeling in your data warehouse to trigger enrichment and activation in all of your customer tools. Start building smarter customer data pipelines today with RudderStack. Visit dataengineeringpodcast.com/rudder to learn more and sign-up for our no credit card required, no time limit free tier.
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
- Your host is Tobias Macey and today I’m interviewing Vinoth Chandar about Apache Hudi, a data lake management layer for supporting fast and incremental updates to your tables.
- How did you get involved in the area of data management?
- Can you describe what Hudi is and the story behind it?
- What are the use cases that it is focused on supporting?
- There have been a number of alternative table formats introduced for data lakes recently. How does Hudi compare to projects like Iceberg, Delta Lake, Hive, etc.?
- Can you describe how Hudi is architected?
- How have the goals and design of Hudi changed or evolved since you first began working on it?
- If you were to start the whole project over today, what would you do differently?
- Can you talk through the lifecycle of a data record as it is ingested, compacted, and queried in a Hudi deployment?
- One of the capabilities that is interesting to explore is support for arbitrary record deletion. Can you talk through why this is a challenging operation in data lake architectures?
- How does Hudi make that a tractable problem?
- What are the data platform components that are needed to support an installation of Hudi?
- What is involved in migrating an existing data lake to use Hudi?
- How would someone approach supporting heterogeneous table formats in their lake?
- As someone who has invested a lot of time in technologies for supporting data lakes, what are your thoughts on the tradeoffs of data lake vs data warehouse and the current trajectory of the ecosystem?
- What are the most interesting, innovative, or unexpected ways that you have seen Hudi used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Hudi?
- When is Hudi the wrong choice?
- What do you have planned for the future of Hudi?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email firstname.lastname@example.org) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Hudi Docs
- Hudi Design & Architecture
- Incremental Processing
- CDC == Change Data Capture
- Oracle GoldenGate
- Iceberg Table Format
- Hive ACID
- Apache Kudu
- Delta Lake
- Optimistic Concurrency Control
- MVCC == Multi-Version Concurrency Control