Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

00:00:00
/
00:58:10

May 15th, 2022

58 mins 10 secs

Your Host

About this Episode

Summary

Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional layers of difficulty. Srivatsan Sridharan has had the opportunity to design, build, and run data lake platforms for both Yelp and Robinhood, with many valuable lessons learned from each experience. In this episode he shares his insights and advice on how to approach such an undertaking in your own organization.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
  • Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
  • Your host is Tobias Macey and today I’m interviewing Srivatsan Sridharan about the technological, staffing, and design considerations for building a data platform

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what your experience has been with designing and implementing data platforms?
  • What are the elements that you have found to be common requirements across organizations and data characteristics?
  • What are the architectural elements that require the most detailed consideration based on organizational needs and data requirements?
  • How has the ecosystem for building maintainable and usable data lakes matured over the past few years?
    • What are the elements that are still cumbersome or intractable?
  • The streaming ecosystem has also gone through substantial changes over the past few years. What is your synopsis of the meaningful differences between todays options and where we were ~6 years ago?
  • How did your experiences at Yelp inform your current architectural approach at Robinhood?
  • Can you describe your current platform architecture?
    • What are the primary capabilities that you are optimizing for?
  • What is your evaluation process for determining what components to use in your platform?
    • How do you approach the build vs. buy problem and quantify the tradeoffs?
  • What are the most interesting, innovative, or unexpected ways that you have seen your data systems used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing data platforms across your career?
  • When is a data lake architecture the wrong choice?
  • What do you have planned for the future of the data platform at Robinhood?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast