Low Friction Data Governance With Immuta - Episode 164


Data governance is a term that encompasses a wide range of responsibilities, both technical and process oriented. One of the more complex aspects is that of access control to the data assets that an organization is responsible for managing. The team at Immuta has built a platform that aims to tackle that problem in a flexible and maintainable fashion so that data teams can easily integrate authorization, data masking, and privacy enhancing technologies into their data infrastructure. In this episode Steve Touw and Stephen Bailey share what they have built at Immuta, how it is implemented, and how it streamlines the workflow for everyone involved in working with sensitive data. If you are starting down the path of implementing a data governance strategy then this episode will provide a great overview of what is involved.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!

Monte Carlo LogoStruggling with broken pipelines? Stale dashboards? Missing data?

If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform!

Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today!

Visit http://www.dataengineeringpodcast.com/montecarlo to learn more.


Config Cat LogoFeature flagging is a simple concept that enables you to ship features faster, test in production, and do easy rollbacks without redeploying code. Teams using feature flags release new features with less risk, and release more often. Developers using feature flags need to merge less.

This episode is sponsored ConfigCat.

  • Easily use flags in your code with ConfigCat libraries for Python and 9 other platforms.
  • Toggle your feature flags visually on the visual Dashboard.
  • Hide or expose features in your application without redeploying code.
  • Set targeting rules to allow you to control who has access to new features.

ConfigCat allows you to get features out faster, test in production, and do easy rollbacks.

With ConfigCat’s simple API and clear documentation, you’ll have your initial proof of concept up and running in minutes. Train new team members in minutes also, and you don’t pay extra for team size. With the simple UI, the whole team can use it effectively.

Whether you are an individual or team you can try it out with their forever free plan. Or get 35% off any paid plan with code DATAENGINEERING

Release features faster with less risk with ConfigCat. Check them out at today at dataengineeringpodcast.com/configcat


  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Feature flagging is a simple concept that enables you to ship faster, test in production, and do easy rollbacks without redeploying code. Teams using feature flags release new software with less risk, and release more often. ConfigCat is a feature flag service that lets you easily add flags to your Python code, and 9 other platforms. By adopting ConfigCat you and your manager can track and toggle your feature flags from their visual dashboard without redeploying any code or configuration, including granular targeting rules. You can roll out new features to a subset or your users for beta testing or canary deployments. With their simple API, clear documentation, and pricing that is independent of your team size you can get your first feature flags added in minutes without breaking the bank. Go to dataengineeringpodcast.com/configcat today to get 35% off any paid plan with code DATAENGINEERING or try out their free forever plan.
  • You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat!
  • Your host is Tobias Macey and today I’m interviewing Steve Touw and Stephen Bailey about Immuta and how they work to automate data governance


  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what you have built at Immuta and your motivation for starting the company?
  • What is data governance?
    • How much of data governance can be solved with technology and how much is a matter of process and communication?
  • What does the current landscape of data governance solutions look like?
    • What are the motivating factors that would lead someone to choose Immuta as a component of their data governance strategy?
  • How does Immuta integrate with the broader ecosystem of data tools and platforms?
    • What other workflows or activities are necessary outside of Immuta to ensure a comprehensive governance/compliance strategy?
  • What are some of the common blind spots when it comes to data governance?
  • How is the Immuta platform architected?
    • How have the design and goals of the system evolved since you first started building it?
  • What is involved in adopting Immuta for an existing data platform?
    • Once an organization has integrated Immuta, what are the workflows for the different stakeholders of the data?
  • What are the biggest challenges in automated discovery/identification of sensitive data?
    • How does the evolution of what qualifies as sensitive complicate those efforts?
  • How do you approach the challenge of providing a unified interface for access control and auditing across different systems (e.g. BigQuery, Snowflake, RedShift, etc.)?
  • What are the complexities that creep into data masking?
    • What are some alternatives for obfuscating and managing access to sensitive information?
  • How do you handle managing access control/masking/tagging for derived data sets?
  • What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Immuta?
  • When is Immuta the wrong choice?
  • What do you have planned for the future of the platform and business?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat


The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Liked it? Take a second to support the Data Engineering Podcast on Patreon!