Managing The DoorDash Data Platform - Episode 176

Summary

The team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner. In this episode the head of data platform for DoorDash, Sudhir Tonse, discusses the technologies that they are using, the approach that they take to adding new systems, and how they think about priorities for what to support for the whole company vs what to leave as a specialized concern for a single team. This is a valuable look at how to manage a large and growing data platform with that supports a variety of teams with varied and evolving needs.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


DDatafold Logoatafold gives you visibility and confidence in the quality of your analytical data with fast dataset diffing, profiling, column-level lineage, and intelligent anomaly detection. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Go to dataengineeringpodcast.com/datafold to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.


RudderStack LogoRudderStack is the smart customer data pipeline. It takes the toil out of building data pipelines that connect your whole customer data stack. Its easy-to-use SDKs and source integrations, Cloud Extract integrations, transformations, and expansive library of destination and warehouse integrations makes building customer data pipelines for both event streaming and cloud-to-warehouse ELT simple. RudderStack’s warehouse-first approach and Warehouse Actions functionality makes your customer data stack smarter by enabling analysis and modeling in your data warehouse to trigger enrichment and activation in all of your customer tools. Start building smarter customer data pipelines today with RudderStack. Visit dataengineeringpodcast.com/rudder to learn more and sign-up for our no credit card required, no time limit free tier.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
  • RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
  • Your host is Tobias Macey and today I’m interviewing Sudhir Tonse about how the team at DoorDash designed their data platform

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving a quick overview of what you do at DoorDash?
    • What are some of the ways that data is used to power the business?
  • How has the pandemic affected the scale and volatility of the data that you are working with?
  • Can you describe the type(s) of data that you are working with?
    • What are the primary sources of data that you collect?
      • What secondary or third party sources of information do you rely on?
    • Can you give an overview of the collection process for that data?
  • In selecting the technologies for the various components in your data stack, what are the primary factors that you consider when evaluating the build vs. buy decision?
  • In your recent post about how you are scaling the capabilities and capacity of your data platform you mentioned the concept of maintaining a "paved path" of supported technologies to simplify integration across teams. What are the technologies that you use and rely on for the "paved path"?
  • How are you managing quality and consistency of your data across its lifecycle?
    • What are some of the specific data quality solutions that you have integrated into the platform and "paved path"?
  • What are some of the technologies that were used early on at DoorDash that failed to keep up as the business scaled?
    • How do you manage the migration path for adopting new technologies or techniques?
  • In the same post you mentioned the tendency to allow for building point solutions before deciding whether to generalize a given use case into a generalized platform capability. Can you give some examples of cases where a point solution remains a one-off versus when it needs to be expanded into a widely used component?
  • How do you identify and tracking cost factors in the data platform?
    • What do you do with that information?
  • What is your approach for identifying and measuring useful OKRs (Objectives and Key Results)?
    • How do you quantify potentially subjective metrics such as reliability and quality?
  • How have you designed the organizational structure for your data teams?
    • What are the responsibilities and organizational interfaces for data engineers within the company?
    • How have the organizational structures/patterns shifted or changed at different levels of scale/maturity for the business?
  • What are some of the most interesting, useful, unexpected, or challenging lessons that you have learned during your time as a data professional at DoorDash?
  • What are some of the upcoming projects or changes that you anticipate in the near to medium future?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Liked it? Take a second to support the Data Engineering Podcast on Patreon!