Designing And Building Data Platforms As A Product - Episode 218

Summary

The term "data platform" gets thrown around a lot, but have you stopped to think about what it actually means for you and your organization? In this episode Lior Gavish, Lior Solomon, and Atul Gupte share their view of what it means to have a data platform, discuss their experiences building them at various companies, and provide advice on how to treat them like a software product. This is a valuable conversation about how to approach the work of selecting the tools that you use to power your data systems and considerations for how they can be woven together for a unified experience across your various stakeholders.

Atlan LogoHave you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?

Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.

Go to dataengineeringpodcast.com/atlan and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.


Hightouch LogoHightouch is the leading Reverse ETL platform. Your data warehouse is your source of truth for customer data. Hightouch syncs this data to the tools that your business teams rely on. Hightouch has a catalog of flexible destinations including Salesforce, HubSpot, Zendesk, NetSuite, and ad platforms like Facebook or Google. Hightouch is built for data engineers and is a natural extension to the modern data stack with out-of-the-box integrations with your favorite tools like dbt, Fivetran, Airflow, Slack, PagerDuty, and DataDog.

It’s simple — connect your data warehouse, paste a SQL query, and use our visual mapper to specify how data should appear in downstream tools. No scripts, just SQL. Get started for free at dataengineeringpodcast.com/hightouch


Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
  • Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
  • Your host is Tobias Macey and today I’m interviewing Lior Gavish, Lior Solomon, and Atul Gupte about the technical, social, and architectural aspects of building your data platform as a product for your internal customers

Interview

  • Introduction
  • How did you get involved in the area of data management? – all
  • Can we start by establishing a definition of "data platform" for the purpose of this conversation?
  • Who are the stakeholders in a data platform?
    • Where does the responsibility lie for creating and maintaining ("owning") the platform?
  • What are some of the technical and organizational constraints that are likely to factor into the design and execution of the platform?
  • What are the minimum set of requirements necessary to qualify as a platform? (as opposed to a collection of discrete components)
    • What are the additional capabilities that should be in place to simplify the use and maintenance of the platform?
  • How are data platforms managed? Are they managed by technical teams, product managers, etc.? What is the profile for a data product manager? – Atul G.
  • How do you set SLIs / SLOs with your data platform team when you don’t have clear metrics you’re tracking? – Lior S.
  • There has been a lot of conversation recently about different interpretations of the "modern data stack". For a team who is just starting to build out their platform, how much credence should they be giving to those debates?
    • What are the first steps that you recommend for those practitioners?
    • If an organization already has infrastructure in place for data/analytics, how might they think about building or buying their way toward a well integrated platform?
  • Once a platform is established, what are some challenges that teams should anticipate in scaling the platform?
    • Which axes of scale have you found to be most difficult to manage? (scale of infrastructure capacity, scale of organizational/technical complexity, scale of usage, etc.)
    • Do we think the "data platform" is a skill set? How do we split up the role of the platform? Is there one for real-time? Is there one for ETLs?
    • How do you handle the quality and reliability of the data powering your solution?
  • What are helpful techniques that you have used for collecting, prioritizing, and managing feature requests?
    • How do you justify the budget and resources for your data platform?
    • How do you measure the success of a data platform?
  • What is the relationship between a data platform and data products?
  • Are there any other companies you admire when it comes to building robust, scalable data architecture?
  • What are the most interesting, innovative, or unexpected ways that you have seen data platforms used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while building and operating a data platform?
  • When is a data platform the wrong choice? (as opposed to buying an integrated solution, etc.)
  • What are the industry trends that you are monitoring/excited for in the space of data platforms?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Liked it? Take a second to support the Data Engineering Podcast on Patreon!