Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

12 September 2022

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata - E324

0:00/0:00

Share on social media:

Summary

Data engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. As they scale, the problems of visibility and dependency management can increase at an exponential rate. In order to turn this into a tractable problem one approach is to define and enforce contracts between producers and consumers of data. Ananth Packildurai created Schemata as a way to make the creation of schema contracts a lightweight process, allowing the dependency chains to be constructed and evolved iteratively and integrating validation of changes into standard delivery systems. In this episode he shares the design of the project and how it fits into your development practices.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
Your host is Tobias Macey and today I’m interviewing Ananth Packkildurai about Schemata, a modelling framework for decentralised domain-driven ownership of data.

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what Schemata is and the story behind it?
- How does the garbage in/garbage out problem manifest in data warehouse/data lake environments?
What are the different places in a data system that schema definitions need to be established?
- What are the different ways that schema management gets complicated across those various points of interaction?
Can you walk me through the end-to-end flow of how Schemata integrates with engineering practices across an organization’s data lifecycle?
- How does the use of Schemata help with capturing and propagating context that would otherwise be lost or siloed?
How is the Schemata utility implemented?
- What are some of the design and scope questions that you had to work through while developing Schemata?
What is the broad vision that you have for Schemata and its impact on data practices?
How are you balancing the need for flexibility/adaptability with the desire for ease of adoption and quick wins?
The core of the utility is the generation of structured messages How are those messages propagated, stored, and analyzed?
What are the pieces of Schemata and its usage that are still undefined?
What are the most interesting, innovative, or unexpected ways that you have seen Schemata used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Schemata?
When is Schemata the wrong choice?
What do you have planned for the future of Schemata?

Contact Info

ananthdurai on GitHub
@ananthdurai on Twitter
LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers