Revisiting The Technical And Social Benefits Of The Data Mesh - Episode 250

Summary

The data mesh is a thesis that was presented to address the technical and organizational challenges that businesses face in managing their analytical workflows at scale. Zhamak Dehghani introduced the concepts behind this architectural patterns in 2019, and since then it has been gaining popularity with many companies adopting some version of it in their systems. In this episode Zhamak re-joins the show to discuss the real world benefits that have been seen, the lessons that she has learned while working with her clients and the community, and her vision for the future of the data mesh.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Datafold LogoDatafold helps you deal with data quality in your pull request. It provides automated regression testing throughout your schema and pipelines so you can address quality issues before they affect production. No more shipping and praying, you can now know exactly what will change in your database ahead of time.

Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.


Atlan LogoHave you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?

Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.

Go to dataengineeringpodcast.com/atlan and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
  • Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
  • Your host is Tobias Macey and today I’m welcoming back Zhamak Dehghani to talk about her work on the data mesh book and the lessons learned over the past 2 years

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving a brief recap of the principles of the data mesh and the story behind it?
  • How has your view of the principles of the data mesh changed since our conversation in July of 2019?
  • What are some of the ways that your work on the data mesh book influenced your thinking on the practical elements of implementing a data mesh?
  • What do you view as the as-yet-unknown elements of the technical and social design constructs that are needed for a sustainable data mesh implementation?
  • In the opening of your book you state that "Data Mesh is a new approach in sourcing, managing, and accessing data for analytical use cases at scale". As with everything, scale is subjective, but what are some of the heuristics that you rely on for determining when a data mesh is an appropriate solution?
  • What are some of the ways that data mesh concepts manifest at the boundaries of organizations?
  • While the idea of federated access to data product quanta reduces the amount of coordination necessary at the organizational level, it raises the spectre of more complex logic required for consumers of multiple quanta. How can data mesh implementations mitigate the impact of this problem?
  • What are some of the technical components that you have found to be best suited to the implementation of data elements within a mesh?
  • What are the technological components that are still missing for a mesh-native data platform?
  • How should an organization that wishes to implement a mesh style architecture think about the roles and skills that they will need on staff?
    • How can vendors factor into the solution?
  • What is the role of application developers in a data mesh ecosystem and how do they need to change their thinking around the interfaces that they provide in their products?
  • What are the most interesting, innovative, or unexpected ways that you have seen data mesh principles used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh implementations?
  • When is a data mesh the wrong approach?
  • What do you think the future of the data mesh will look like?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Liked it? Take a second to support the Data Engineering Podcast on Patreon!