In this episode of the Data Engineering Podcast Matt Topper, president of UberEther, talks about the complex challenge of identity, credentials, and access control in modern data platforms. With the shift to composable ecosystems, integration burdens have exploded, fracturing governance and auditability across warehouses, lakes, files, vector stores, and streaming systems. Matt shares practical solutions, including propagating user identity via JWTs, externalizing policy with engines like OPA/Rego and Cedar, and using database proxies for native row/column security. He also explores catalog-driven governance, lineage-based label propagation, and OpenTDF for binding policies to data objects. The conversation covers machine-to-machine access, short-lived credentials, workload identity, and constraining access by interface choke points, as well as lessons from Zanzibar-style policy models and the human side of enforcement. Matt emphasizes the need for trust composition - unifying provenance, policy, and identity context - to answer questions about data access, usage, and intent across the entire data path.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
- Your host is Tobias Macey and today I'm interviewing Matt Topper about the challenges of managing identity and access controls in the context of data systems
- Introduction
- How did you get involved in the area of data management?
- The data ecosystem is a uniquely challenging space for creating and enforcing technical controls for identity and access control. What are the key considerations for designing a strategy for addressing those challenges?
- For data acess the off-the-shelf options are typically on either extreme of too coarse or too granular in their capabilities. What do you see as the major factors that contribute to that situation?
- Data governance policies are often used as the primary means of identifying what data can be accesssed by whom, but translating that into enforceable constraints is often left as a secondary exercise. How can we as an industry make that a more manageable and sustainable practice?
- How can the audit trails that are generated by data systems be used to inform the technical controls for identity and access?
- How can the foundational technologies of our data platforms be improved to make identity and authz a more composable primitive?
- How does the introduction of streaming/real-time data ingest and delivery complicate the challenges of security controls?
- What are the most interesting, innovative, or unexpected ways that you have seen data teams address ICAM?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on ICAM?
- What are the aspects of ICAM in data systems that you are paying close attention to?
- What are your predictions for the industry adoption or enforcement of those controls?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- UberEther
- JWT == JSON Web Token
- OPA == Open Policy Agent
- Rego
- PingIdentity
- Okta
- Microsoft Entra
- SAML == Security Assertion Markup Language
- OAuth
- OIDC == OpenID Connect
- IDP == Identity Provider
- Kubernetes
- Istio
- Amazon CEDAR policy language
- AWS IAM
- PII == Personally Identifiable Information
- CISO == Chief Information Security Officer
- OpenTDF
- OpenFGA
- Google Zanzibar
- Risk Management Framework
- Model Context Protocol
- Google Data Project
- TPM == Trusted Platform Module
- PKI == Public Key Infrastructure
- Passskeys
- DuckLake
- Accumulo
- JDBC
- OpenBao
- Hashicorp Vault
- LDAP
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL. The result, inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high memory machines or distributed compute.
Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI engineering, streaming, Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect. Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy? DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure perfect parity between your old and new systems.
Whether you're moving from Oracle to Snowflake, migrating stored procedures to dbt, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafull to book a demo and see how they turn months long migration nightmares into week long success stories. Your host is Tobias Macy, and today I'm interviewing Matt Topper about the challenges of managing identity and access controls in the context of data systems. So, Matt, can you start by introducing yourself?
[00:02:05] Matt Topper:
Yeah. Sure. So Matt Topper, president of UberEther. Been trying to solve a little bit of this, world of identity and data coming together for, sadly, we were joking, about thirty years. And, obviously, I'm not very good at my job because I think it's a mess for everyone still. So
[00:02:24] Tobias Macey:
And do you remember how you first got started working in the data space?
[00:02:28] Matt Topper:
Yeah. So I was fortunate enough right out of high school. My dad had called one of his old mentees that he used to work with and said, hey. My kid needs something to do over the summer before he starts college. You got it. And, luckily enough, I got thrown into a company that had all of the vehicle, you know, registrations for the entire world. And thirty years ago, those were some of the biggest databases before kind of the Internet started gobbling every move we make and ran into a whole bunch of great data problems there of sales records being aggregated across every major manufacturer. Right after a car is sold, the manufacturer has no idea who uses it, so we aggregated all the data across essentially every state, every country, and then kind of sold that back. That turned into the problem of everywhere I went, people were like, don't do the security thing. That's where you gotta get fired if you make a mistake. That's awesome. I wanna do that for the rest of my life. And, since then, I've been trying to get all of this working together.
[00:03:34] Tobias Macey:
As you mentioned, despite having been working in this space for decades and it not being a solved problem and in some cases, maybe even becoming worse, I think a big part of that too is that at least in the context of data systems, for some brief period of the history of working with data, the predominant means of dealing with this challenge was to buy some sort of vertically integrated stack from an enterprise vendor. And, of course, they had handled all of the identity and access control challenges to some approximation. But now we're in the much broader and more composable ecosystem of bring your own solution for x and then stick it all together with whatever sort of superglue you can find.
And, of course, every system wants to own their own piece of that identity and access challenge, and none of them do it in the same way. So now we're stuck with an even bigger integration problem. And I'm just wondering if you can just talk to some of the ways that the data ecosystem is uniquely challenging for this problem of access control and identity management and auditability beyond just the standard, in air quotes, case of managing identity for a single application.
[00:04:52] Matt Topper:
Yeah. I was lucky enough, and I'm sure a lot of people on the podcast, when I say the word Oracle, either cringe or cheer with joy. I don't think there's a middle in that, but, was part of the Oracle team that worked on some of their original label security pieces. That got into some interesting spaces when we worked in federal government because we would be sharing data across an unclassified secret secret network with the right? The same user on those different networks who would be able to see different things based on their classification level, and then within that, federating out to other countries accessing that data across the world.
And we ended up having to extend the JBC connectors at that time to say, okay. Give me the Java thread in the container where the user identity was being persisted. Pass that as part of the label security. We would literally pick the thread out of the pool, assert the user identity on that thread, and then that would allow us to filter in label security with a can and can't see. But, obviously, that was super specific to Oracle. And luckily enough, we were on the inside that we could modify drivers. We could modify the back end, and that doesn't really exist across the ecosystem today. Some of the things I'm seeing starting is some of the databases have started to accept JWT tokens.
So you're able to actually start passing from the user's browser, if it's in that case, or the service account, to the API, to the API, and chain the identities through to say, I know the paths that this data is coming in through. And the more you can look towards those, the better off you're going to be. Right? Because that gives you every hop along the way who the user is, who the accounts the service accounts they're going through are, maybe even some of the context of where in the world, and then at least use those to use the JWT to call out, get the person's roles, and then apply that to any policies you have in the database. Obviously, your data integration layer may have their own way of doing that, which gets into, as you said. Right? They all have their own policy rules and their own special, languages for that.
Where I've seen some successes, There's a language called OPA, open policy. Or Rago is the policy language. OPA is kind of the enforcement point. And I've seen not built into a lot of the databases or data lakes yet, but an ability for basically, a proxy in front of that to take those calls, say, hey. Based on this user and this path, what are they allowed to query or not? And it'll return a list of either roles or different tags you've got on the data that you can then apply to your where clauses dynamically. And that's that's been quite successful in some of the orgs we've worked with.
[00:07:48] Tobias Macey:
So you've mentioned a few different technologies and protocols that have been emerging largely as an outgrowth of the investment in cloud native architectures and the Kubernetes ecosystem and the growth of microservices for better or worse.
[00:08:05] Matt Topper:
And that's an interesting Can I be that opinionate?
[00:08:11] Tobias Macey:
And so that's a set of techniques and protocols that have grown out in one ecosystem, and it's starting to filter into the space of data platforms, I think largely because the data engineering role has grown overlap with the platform engineering role as we gain some levels of sophistication in how to deal with these increasingly complex systems. And before we get too much further down that road, I think it's probably worth moving up to a little bit of a higher level first and just enumerating what we mean when we talk about these questions of identity and credentials and access management and some of the different ways that they manifest, particularly from the end user perspective of, I just wanna be able to access the data. But then from the organizational perspective, we need to be able to say, okay. You can see these attributes of the data, but not these other ones because of regulatory or data privacy considerations.
[00:09:09] Matt Topper:
Yeah. So taking a that step back, right, normally, inside of a browser, you're gonna have some sort of authentication event, which normally will go to an identity provider. So most organizations have that single sign on, single login page. A lot of times, that's gonna be Ping Identity, Okta, Microsoft, Entra ID, or your Google Workspace. And, right at that point, here's the authentication event. Here's how strong that authenticator was. Because sometimes you might say, hey. If you're using username and password, you're not getting the golden goose. Right? You're not getting the good data that we have because that's so easily compromised from a username and password perspective. So once you log in there, most of the time, you're gonna get either a SAML assertion for the app or you're gonna get a JSON web token through OAuth or OpenID Connect. So that's what your front door application is gonna see, and it's gonna be either a unique identifier that says go back to the IDP and look it up and tell me who this person is, which is the my preferred method because we all know the browser is the most easily compromised thing in the entire path. And, right, at that point, that token gets signed to the service, and then the service has to present their key to say, hey. Who's Matt? Who or who is this? And then the service will say it's Matt. So now you have two places you need to compromise to get that stolen. So, right, that gets that. That now knows who the person is.
Well, at that point, a lot of times in these Kubernetes ecosystems, that web application then has to call an API. Right? And that web application is gonna get a short term certificate if they're using something like Istio. But at the end of the day, it's the credential for a service account. So a lot of organizations will either embed username and passwords, aka OAuth tokens and secrets, to call those APIs. And then with that, call back to the IDP as a security token service and say, hey. I'm now making a call to this API on behalf of this user. So now we've got a chain of those. And finally, after that, that chain continues down the path to get to the database. And, right, a lot of times, the database doesn't even get to see that, and that's where I'm and I think as you said, seeing more of the JWT tokens being able to be accepted at the data layer. So you can now say, okay. I know this has come through a trusted path. There's no way that that data's been compromised.
And now based on the user, where they're coming from, the services they're accessing, I know that I can only release this data or modify my where clauses, my queries this way or another.
[00:11:53] Tobias Macey:
The other interesting wrinkle in this space that has been growing in the past few years is that databases aren't even necessarily the storage and retrieval mechanism for the information and the access, particularly as we move more into this AI era where everything is either files on disk in the cloud somewhere or an s three bucket or some sort of vector store. And so now we have to be able to manage access controls at a larger number of touch points where, for a while, you could just say, as long as my database controls are sufficient, then I can feel good about people being able to access the things they need to access. Now maybe you have a SQL interface using something like Atrino or some some other federated query engine, but maybe you're sending people directly to those files in an s three bucket. And so then you need to have those same types of access control available in an environment where maybe it doesn't even understand anything about the column structure, but you still need to be able to manage some sort of masking or encryption to make sure that only people who are permitted to have access to address data or grade information, etcetera.
[00:13:06] Matt Topper:
Yeah. And that's like, I can say in the identity community side, that's a having lived in the identity community and the data community for so long, there's a lot of parallels of the same problems we all have. Right? Whereas we're not always embedded into these application teams that are designing the solutions, and they're just saying, I need the access to this thing. And the biggest thing is there is no good global shared language for access control at this point. Point. So, right, you could say, here's core screened. You've got role level access.
But at the end of the day, in the data world, those are always too simple. And at the end of the day, that then becomes too insecure because it's like you get access to that bucket on s three or that folder in that bucket on s three, but doesn't say, oh, in that parquet table, here's the columns. And within those columns, you only get where values is equal to x, y, and z, which then gets you into the fine level row and column controls, but they're completely unmanageable at scale. Right? Every single time you've got a new bucket, a new folder, a new file site, you're now having to write all these super fine grain policies. And we're we're starting to see more success, and I'll say it's it's not great right now, but it's kinda one of these things where we've gotta figure out a way to organically get there is the policy abstraction layer. So in the identity world we live in, there's, right, a policy enforcement point, which in right. Just say, an s three bucket is the front door of who can access what files inside of there, and then a policy decision point, which is essentially abstracted from that, that then says, okay. Based on the user and the folder they're looking at, what are now more fine grained controls to look at? But then it's not embedded into all of the apps and embedded into all of the code. And that's where, traditionally, we've seen things like the XACML language, which is incredibly complex, incredibly hard to write to.
The admin interfaces to write policies are, in most cases, garbage. And then lately, we've seen the growing up of the rego, OPPA, as well as Amazon came out with a policy language called CEDAR that has actually grown quite quite well that is very good and focused at the data problem. So I know a couple of the team members on that team, and I believe, I believe Snowflake is adopting Cedar, if I remember correctly. But it's been a little since I looked at it. But that'll at least give us an ability to say, hey. Here's a set of rules for GDPR or for HIPAA policies that says, here's what's gotta be encrypted. Here's how it's gotta be encrypted. Here's how it can be released or not released. And then you can start changing those policies in one place and reapply them to the data throughout your entire ecosystem versus trying to have to craft these one off solutions.
[00:16:10] Tobias Macey:
The other interesting piece, as you said, is we have these abstraction layers where we have the Rigo policy language. I know that there's the Oso project that has their Polaris language. You mentioned Cedar. And so that is one integration point that we can use to move more of the common logic to act as that, multiplexer for the problems that we're trying to solve. But then you end up in a situation where maybe you're using a tool or a platform that doesn't have an easy integration point to be able to act as that translation, or it's a vendor managed solution and you just get whatever they wanna give you. And so then you're in a situation of having to either build your own glue or just throwing up your hands and saying, I give up. And so we we we have this problem of we either have too much granularity and control, and so it's an exercise in frustration just trying to even figure out what are the things that I need to say and do. AWS's IAM policy language is probably the best example of something that is too fine grained and too complex, and nobody ever knows for sure if what they're doing is actually right, including anecdotally people who work at Amazon and build the IAM system. Or you have something where it's too coarse grain, and it's either just a Boolean, you have access or you don't. Or maybe it's, to your point, you either have access to this bucket or this folder or you don't, but that doesn't give you the necessary level of consideration. And I'm just wondering how you're seeing people start to try and address this inherent complexity and the highly dimensional nature of this problem space where we want to be able to have attribute based access control where we can just have these conditional checks that filter all the way down, but either creating and propagating them is impossible or too complex.
[00:18:02] Matt Topper:
Yeah. Sadly, the only way to force it that I found across the vendors is and, right, this is an impossible task for most data and identity engineers is you have to force it on your procurement teams when these tools are being bought and say, unless you're supporting these standards to externalize the data or externalize the policies, we're just not gonna buy you. And, right, just depending on the size of your organization, you do or don't have that level of push. And a lot of times, it's gotta be a lot of big organizations making that push together and working towards it. Because having worked for a lot of the large vendors or with them over time, this is one of the wonderful ways they love to lock you in and make sure you can't move to somebody else because you are gonna spend a ton of time writing these tools and policies or these policies to enforce your data, and then you go to move to another tool and somebody goes, oh, that's a three year project or even a six month project. Well, we don't have time, and then it just continues to proliferate out. At the end of the day, it's either that approach, which see is seemingly more successful from what I've seen, or you're writing your own abstraction layers on top of these tools that do the enforcement and then just reflect the filtered queries back to the tools themselves, which then becomes a one off continuous nightmare as well to manage. There there truly is no good solution I found, and good luck selling that to executives and, right, getting the time to build those things too.
[00:19:33] Tobias Macey:
I think another aspect of the complexity is that even if you do have a policy language where you can say you have access to PII, you do not have access to PII, and being able to enforce that, it's not even always clear, one, what constitutes PII because there are different levels and gradations of that, but also whether or not the thing that they're being granted access to contains PII because that's not something that is always identified at the ingest layer or at or even if it is identified at the ingest layer, maybe it's not propagated downstream. And so just being able to manage the identification of which data sources and which datasets and which attributes are the things that actually need to be protected, assuming that you have the necessary controls to enforce those protections.
And I'm just wondering how you see people try to solve that, particularly now that we're moving into the space of unstructured datasets where PII is an even more difficult challenge to identify and address.
[00:20:35] Matt Topper:
Yeah. I think a lot of that where I've seen organizations have success is it's the idea of the data governance layer above the data itself and how that flows through your organization and where it goes. So whether you're building that into your data orchestration layers or, I personally like data catalog systems. Right? There's some good open source tools out there for that that says, hey. Here's all the endpoints we've got for APIs across our organization. Here's whether that's data related or, right, service APIs on the other side. And here's all of the elements that are being returned. And then within that catalog, you can at least start as data engineers filtering down and going, why can't it say SSN?
Oh, that's the super secure node and not Social Security number. Right? But, like, by having a centralized catalog that you can look to and kinda run over, then, right, ideally, that correct log is also seeing your orchestration pipeline. You can start to see all those basically, all the, relationships between those pieces and tie it together. Right? There's some really advanced ways that this has been done. So I've worked in the US intelligence community for a long time. And, right, there's always this fun problem of, hey, we're friends with this group today, but we might be mortal enemies tomorrow. Or based on this conflict, we're friends to get this mission done. But as soon as that mission's over, we don't want them to have the data in. And there's been some interesting ways we've solved that of wrapping the data before it can be used and providing our own APIs around that and some extremely complex key management capabilities.
But most organizations don't have the time, money, or resources to do that, nor is it something that's even near a standard today. So, yeah, that's there's no good answer. Have you seen anything in stuff you've done?
[00:22:43] Tobias Macey:
Not any universally applicable method. So to your point, data catalogs, I think, are probably the most holistic way to do it, presuming that you have a means of being able to do some of that identification whether in an automated fashion or even in a manual fashion. And then as long as it has a good lineage tracking capability, then you can use that to propagate those labels to downstream data assets and then be able to figure out where do I need to manage these access controls. So the tool that I'm working on integrating with and investing in is open metadata, which has some of those capabilities. The other piece that I've seen some investment in, and, unfortunately, it's now behind a paywall, but the DBT fusion engine, which, was previously I'm blanking on the acronym that they used as their name, but they were a team that got acquired by DBT. But they were a rest based engine that would allow you to have more descriptive typing information on the columns for being able to say, this is PII or this is PII of this sort. This is an address. This is a Social Security number, etcetera. And then through the DBT execution graph, be able to propagate those labels as well and manage some of that granularity of access control. But beyond having some system that understands the way that the data propagates and proliferates and allowing those labels to follow those different transformations, there's not really any other good way to do it other than just being very manual and draconian into your practices or just giving up and saying, I hope nothing bad happens, which is obviously not the right answer.
[00:24:27] Matt Topper:
I hope our CISO has enough of an insurance policy on their job that which is sadly I see during the CISO contracts. They, hey. I've told you that there's no way we can protect this without you guys making the investment. The board's made a decision not to make that investment. So we're gonna put into my CISO contract. You're gonna pay me for two years if we get breached. Like, that's sadly the point we are with data security at this point for many organizations. I did take a quick look. So I mentioned some of the things that have been done with the intel US intelligence community and some of the NATO partners we've got. There is an open format called OpenTF, which is the trust data format. And I can say NATO is pushing a lot of the large vendors to start adopting this. So it is a way to cryptographically bind attribute based access control policies to data objects as they either, right, are sitting as files inside of buckets places. Right? Essentially, it's wrapping those files to then say, I need a trusted identity object to verify against before I'm gonna unlock this data for someone to access. Right? It's a part of the problem set.
It doesn't always get to the fine screen of controls, but it's at least a standard set. There's some pretty large organizations with pretty deep pockets are pushing now, and I've seen it working quite well at scale.
[00:26:02] Tobias Macey:
Composable data infrastructure is great until you spend all of your time gluing it back together. BRUIN is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic and let BRUIN handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.
Go to dataengineeringpodcast.com/bruin today to get started. And for DBT Cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud. Yeah. Another project that I have encountered but haven't invested much time into researching or playing with is open FGA, which I think stands for fine grained access and is based on the Google's Anzabar system. And that's another cloud native computing foundation project that has started to make inroads into the data ecosystem as well. So we are getting to a point where there are enough broadly compatible and open for integration layers that we can start to build some of that unified access control plane, but there's still a lot of work left to do. And then there's also still that problem of some of the systems that you're running are so old that they're never going to work with this, and some of them are so new that they're trying to build their own system. And so just trying to wrangle that complexity is an exercise in futility.
[00:27:37] Matt Topper:
Yeah. And at least when I've used the Zanzibar based systems in the past, they are very fine grained, And you have to have your data structured pretty tightly to be able to essentially, you're defining policies of policies in Zanzibar in many cases. And you really do have to know, like, you have to have very, very set patterns for it to be successful. And a lot of times, you end up having to retag a lot of your data if your patterns change. And, right, depending on how big your datasets are, that can be extremely expensive to do.
[00:28:15] Tobias Macey:
The other challenge in this space is that for that governance protocol to be effective, you need to have acceptance and buy in from the organization to actually force people to adhere to it because, otherwise, they're going to say, but I need access to the data, and my boss said it's fine. And so then the data teams are just left in another situation of throw up their hands and hope for the best. But just wondering how we can I guess, as we move to a space where we actually have more of that easily enforceable fine grained control where you're not preventing people from doing their work just because they can't get access to the data or to a particular table because there's too much sensitive data in it, but they can see just the things that they need? Maybe that's a way of reducing that point of contention and getting more of the organization on board with these access policies and control policies. And I I think the other tension that I've seen too is that you can aggregate data and anonymize it that way, but then you have people saying, oh, but I need to be able to access every single row for purpose x, y, and z. And it's like so just managing some of that education and facilitating the needs of the organization without just being a gatekeeper for the sake of gatekeeping.
[00:29:34] Matt Topper:
Like, day to day side of the world, we run-in the exact same thing, right, where everyone's like, well, I bought this tool, and it's got its own user management interface, and then we're just gonna do it ourselves. And you're just like, jeez. Right? Like, no. You need to get in a centralized governance system. No. You need to reuse these roles because that's how you get away with letting all the data out. So a lot of times, it just becomes the, hey, here's the interfaces we support. Here's how you get access to it, and then it becomes part of you right? However, your organization does governance of rolling out applications to production, right, risk management framework type stuff and saying, hey.
Essentially, it has to be a monumental event to overcome the data team's policy of how you access these things. And you've gotta have CIOs and CTOs and CSOs with strong backbones that say, no. Fix your stuff, devs. We will miss this deadline because you didn't do it right. And, right, that comes to board pressure, but we have at least found that if you can have kinda we we call them brown bags of, like, hey. Every Thursday, we're gonna have five of our best data engineers. If you're rolling stuff out in our organization, they're available from noon to one every other Thursday, every Thursday, and join this call if you've got questions about integrating with our platform. And then, right, that leaves the open. And then when they get to the rolling things out to the governance board, they can take, did you show up to brown bags? Did you work through these things? And then at least the data teams and the app teams have made the effort to have the discussions that the boards can say, no. You didn't do the thing and push back that way. Alright? And then, obviously, it becomes someone's job to take those brown bags and turn them into facts, turn them into backlogs of, yeah, there is a gap here. We actually don't know how to solve this problem for this type of app or this type of new streaming interface that's coming in. And then, right, fax get updated, that type of stuff. But there's, again, never a silver bullet between either of these teams on either side.
[00:31:43] Tobias Macey:
And so, generally, we've been speaking from the perspective of human operators trying to get access to data to be able to perform some role in their job. But then there's also the other side of machine access and managing appropriate controls there where, generally, if it's machine to machine, we just say, oh, well, I just trust that the machine is going to do what I told it to do, so it's fine. As we get into more agentic and AI driven workloads, obviously, that gets even more fraught, but I'm just wondering how you're seeing teams either properly address and maybe some of the ways that you're seeing them fail to address some of those considerations of appropriate access control and role based definitions when there isn't necessarily a human in the loop or when there's a machine that is doing some data fetching that will eventually be displayed to a human.
[00:32:37] Matt Topper:
Yeah. I I've said this a lot lately, especially with the AI. We're all in model context protocol and right. It's obviously hitting the data community hard right now because everyone wants an m p MCP server to access the data behind the scenes. And I will say the Google project that they have for data access is really well done and thought out. People haven't played with or looked at that yet. I can probably send an oh, that's cute. My Google phone just picked up thinking I was asking you a question. But I can send you a link to that project, but it's that you can include in show notes, but very, very well done going to traditional data sources. But when Anthropic first put out that spec, the security section literally listed, whack, whack to do.
Like, it was, yeah. We know this is a hard freaking problem. We're just gonna put out the spec and figure it out later. And that quickly became a like, some of the first patterns for MCPs was, here's how to go into your browser and copy the cookie and give it to the MCP server to then access all of the data. Alright. I think both the data and the identity engineers are, like, having heart attacks of how terrible that is as a pattern. In general, what we like to see and this is coming out this did come out of the Kubernetes ecosystem, but has very much been proven on traditional workloads, is to use more short lived credentials.
And using, really, attestation of the servers or the services that are running to then be able to grant those credentials. So even, right, even in a Kubernetes environment, that container has to be admitted into the cluster. Well, if you can prove that that container is the hash is essentially matching the one that comes from your container registry and hasn't been modified, and it's running on a Kubernetes host that has a TPM on it, right, that every even virtual TPM in every cloud or every cloud environment has, well, then I at least know that that container runs these workflows on trusted hardware that we knew about in management 18 to give it a credential. And then that credential is normally either a JWT token or a PKI certificate that'll rotate every x amount of hours.
So then from a data layer perspective, you can now say, hey. This is assigned, asserted, tested to the workload level, not just the device level, the workload that's running. And for an attacker, if they were to come in and steal that, they'd have to remain persistent there and pick up a new token every two, four, 24, whatever that policy is, hours, and that makes the attackers way easier to spot. It also makes it easier from a data engineering perspective to say, oh, that's tied to this workload, and they're only allowed to query these tables, these columns.
And if somebody tries to go outside of that or they're trying to use one of those tokens from a machine that hasn't been attested, then it's way easier to spot those things from an attacker perspective. And that's sat right? Same type of things with pressures of getting things out the door. A lot of the problems I've seen is you've got these long lived tokens and have, right, pushed all of our users to get rid of their username and passwords and move to pass keys or FIDO tokens or having all of our credentials put into a, right, privilege access management blocker that you have to go log in with your SSO, and then you get the username and password to copy into your app. I apologize for the idea entire identity industry for that pattern because it sucks. But all of that, like, we've changed that for the users. But at the end of the day, we've been giving our applications just another username and password, and maybe we just call it a token. But then instead of being limited to the individual human, it's the entire database.
And it's actually way more insecure and opens the doors to attackers much, much easier. And that's what I've seen all of the pivots from an attack perspective be lately is, right, it'll come in as a human, but immediately, they'll give it to finding shared credentials on services. So by using ephemeral credentials, limiting down how long they can live is forcing bad guys to get caught, and then also allows us from the data engineering side to better limit what things can be accessed. And then if the policies have to change, right, this is part of the the Zanzibar side of things from Google, but also what is publicly known as Istio now. Right? Sometimes when you mint that token, you're gonna put those policies in the token and what returns true for those four hours.
So now when that gets to the data layer, you can use what's in that policy. And if the policy for what that data can access needs to change, well, you're gonna have a window where, yeah, for the next four hours, that token says the policy is XYZ. Well, in four hours, we know it's gonna change to be a, b, and c. So now you're not having to build additional things into your essential data layer to then get the updates from the PDP that's coming through the token on the front door, which actually may be see a pattern for many people to implement.
[00:38:16] Tobias Macey:
I think too that one of the bigger challenges in terms of data architecture is just figuring out what is the choke point or set of choke points that I'm going to rely on for being that policy enforcement layer because you can go all the way down to the metal and say, okay. I need to do my policy enforcement at the bits and bytes layer or at the layer of the individual files and how those are laid out. Or you can say, well, I'm only going to allow access to that underlying infrastructure through these three different interfaces where maybe I have my SQL layer for my data warehouse and my parquet files that are in s three buckets, etcetera.
And maybe I also allow for some sort of notebook system to be able to access them, but the only way that the notebook is going to be able to authenticate to those is either through that SQL layer or through AWS IAM or what have you. And then maybe the only other way that I'm going to allow for this data access is through some sort of controlled export mechanism that is managed by my orchestration system. And so if you want to be able to access this data, I'm going to be the one to hand it to you. You're not going to get it yourself.
[00:39:30] Matt Topper:
Yeah. I mean and or it's just controlling those interfaces and not allowing the back doors of, like, what? This group has their own special flavor that doesn't follow those. But, yeah, that's really right. Again, it's governing the data layer and governing access to it.
[00:39:52] Tobias Macey:
And the other interesting wrinkle that comes in with data systems is that maybe you have all of the best controls for all of your well known data ingest and egress paths, but it's a constantly moving target. There's always new data systems being integrated or new use cases being applied. And to some approximation, you can manage that through having those choke points where these are the supported interfaces. But in particular, as you move from batch into maybe a streaming ecosystem, that brings in a whole new set of technologies and access patterns. And I'm wondering how you're seeing that muddy the waters of how people are thinking about the security controls over those data flows.
[00:40:35] Matt Topper:
Yeah. I'm just gonna blame the network guys. Right. And sadly, it's partially true. Right? For the last thirty years, we've been designing our systems where, hey, as long as you're inside of our network boundary, you're cool. Like and we're gonna put these big walls up to get you in and out of that. But, right, as we've moved out to multiple cloud providers, multiple SaaS providers, those walls have, crumbling down. But those groups got all the investment and the dollars over the last thirty years that we're having to play catch up. And I think it's at the end of the day, we need to figure out a way to standardize how identity travels with the data. And, traditionally, right, it is that, okay, it's gonna be a set point in time with a set place to control how this access or how the results are gonna be returned. And we need to figure out how to move access control to be part of the table definitions, part of the file definitions, and not just the network perimeter. Right? If we can get identity to become metadata, that's both, right, the user, the services that are the user's connecting through, as well as metadata on the actual data objects, that whole authorization layer just becomes automatic. Right? It becomes a union join of all of the things of the past that then makes it super simple to write the policies against. But when the data is flowing without having that metadata of what these objects are is how things continue to get out of control. And that's part of what OpenTDF is trying to solve is every single data object that flows in, out does have metadata that's associated with it that then the authorization decisions have to be made against.
So, right, I think as we're designing these systems today, putting the metadata on your catalogs that then can allow them to be policy driven. Right? Then you can integrate things like OPA and Rego into your DBT models. You can then embed your attribute based access control for data sharing contracts. So, right, once you create that export okay. Well, I'm gonna give you this export and, right, it's a giant CSV file. Well, who's gonna be allowed to unlock that CSV file? And I'm just gonna put it on an s three bucket somewhere and tell you to go pick it up. Well, once they download it, where is it gonna go from there? And, right, putting that in an encrypted format with the metadata around it that says, yeah, you can load it, but you can only load it into this even as if we were able to say this organization's domain would be a huge step forward from where we are today. Or ideally, right, this organization's domain in this service, but when it's being loaded, here's the metadata that needs to be applied to protect it. So that in the future, if I prove you've essentially validated our con invalidated our contract together, I can revoke the access key that lets you unlock that. And that that is what TF is giving us to. It's it's a big mental shift for most data teams to get there.
But after seeing the size of the data and the amount of the data and the number of different ways it gets fed up across the globe, right, at the end of the day, TF grew out of the NSA. Right? So you can imagine the mister Snowden to let us all know how much data they've had and continue to maintain, but it does work. It's just a big mental shift that has to happen.
[00:44:19] Tobias Macey:
Another interesting approach that I've seen recently with the Duck Lake table format is they've introduced the capability of being able to handle some of that row level access control automatically because you can set the key that you want to use as your filter effectively as the partition key, and they will automatically handle partitioning the data based on that value in that field and automatically apply separate encryption keys to that data so that the only way you actually can read any of it is if you've been granted access to read that partition and and read the decryption key to be able to retrieve the data, where maybe you can get the files out of s three, but it's gonna be gobbledygook because you don't have the decryption keys necessary.
[00:45:07] Matt Topper:
Yeah. That's a smart approach as long as you are very good at managing those keys. Right? And that's where a lot of organizations fall down. Alright. Similarly, there was a offshoot of Hadoop years ago, term, Accumulo. And when they first started building Accumulo, they would embed the policies in the row. Basically, it's cell level security, and they would embed those policies at the cell level. Well, you can imagine as you're trying to query across all of those tables, it was a nightmare from performance when you start locking things down. So we started moving to tagging at the cell level, basically, UUID, and then the policies were abstracted outside of it. That the same way you're saying, I haven't looked at the DuckDV side yet. I'm that's gonna be in late night tonight project for me to see how they're doing it. But, alright, evaluate those policies on the outside and then return the unique labels of the UUIDs.
And now that makes it a very easy join to determine what can and can't be released out. But adding that extra encryption layer that that's that you said that's doing is gonna be yeah. I'm gonna dig into it because I've seen the issues, problems, and, hopefully, they've learned from the past or maybe I'm gonna put in a PR later tonight to give them some heads up of what's coming.
[00:46:28] Tobias Macey:
Yeah. So they they've done a very good job from what I can tell. But for teams who are trying to tackle these challenges of identity and access management, we've just enumerated all of the ways that they're doing things wrong, and it seems like a a huge lift. So I'm just I'm just wondering, what are some of the easy lifts or low hanging fruit the teams should be thinking about as they continue to battle this constantly changing ecosystem of how do I make sure that the data I'm responsible for is being accessed appropriately. And maybe that's just a matter of making sure that you have a centralized location for audit logging or just wondering what are some of the concrete steps that you generally advise teams make sure that they have in place even before they get into some of these more esoteric and complex solutions like the OpenTDF?
[00:47:18] Matt Topper:
Yeah. So I think the biggest thing I would suggest for organizations that are actionable to move forward is take a look at every single service account that you've authorized to access your data and do an audit what they can and can't access based on those service account roles, policies that you've put in place today, and really understand why they're querying your data for what information. And I've seen way too many times where it's just like they need access to all of the data for all of the things, and they've got read, write on pretty much every database table or every file within the s three bucket.
That's that's just begging for data expo. And take a look at with those service accounts. Do you have policies in place that you know what normal looks like for those queries from those people. Right? Do those queries 99% of the time come in as a lookup for an individual organization, an individual person, and then if and they return maybe, call it, a thousand rows or a thousand items, would you ever be able to detect if they returned a million instead? Right? And a lot of times that gets thrown over as hopes and prayers to a SOC team that right? We all know. They don't that it's nothing against the SOC teams that just have too much data that they don't understand the context of being thrown at them. So, right, they're only seeing five alarm fires and, right, every row in your database being returned, it's too late. Right? The fire's already burning when they're pulling the trucks up because the stuff's out the door. So that's where I think people really could get started. And once they start understanding, here are the service accounts, here are the rotations on the service account. So what type of credentials are we be using?
How often are they being rotated? Move that down into the well, then, if they're being rotated, say, even on a thirty day cycle, what are the normal patterns look like just from a broad perspective? What are the normal requests, responses? Can we say if something falls outside of that? Right. Now you're at a point that you can start getting into some of these more fine grains, but I will say most organizations we work with don't even have those course level controls or being able to monitor and audit those course level controls happening.
[00:49:47] Tobias Macey:
I think one of the other challenges when you start talking about credential rotation is that the systems that are using those credentials maybe don't have any concept of rotating credentials where you're expected to just put the username and password into a text box somewhere and then save it. And maybe it encrypts the credentials in its storage layer, but there's no way for you to be able to say, hey. Whenever you need a credential, call out to this other thing. Instead, you have to do the integration work of pushing the credentials into that system if it even has a way of doing it in an automated fashion, and maybe you just have to do it manually. And that's definitely a challenge that I've encountered in some of the work that I do where some of my systems, I could say, hey. I'm just going to use the so I use HashiCorp Vault for dynamic credentials. So I'm just gonna use Vault, the secrets operator, and my Kubernetes cluster to fetch the credentials and keep them up to date. But maybe the system that I'm integrating with or using to do some of the data ingest doesn't have any concept of being able to pull from an environment variable. It has to go into a text field somewhere. And so now I have to build that API bridge to push those credentials into the connection configuration.
[00:50:56] Matt Topper:
Yeah. I've seen a lot of that as well. So the first step I see a lot of is right? It's they don't even understand how things work with the products they're using a lot of times, and they just know that if they put the username and password that the data team provides in this field, that it's gonna work. And they don't know what pieces and parts they can rotate and how. And, right, again, it's you gotta have some backbone from the leadership teams of saying, no. You're gonna spend the time on this, and you're gonna do it the right way. And, yeah, it's gonna take two weeks to write a script, test, and validate that script to rotate those credentials, but that's good for the team anyways.
And, right, sometimes there are tools that don't do that, and what I found successful there is, okay, you can embed that username and password, but we're gonna take it out of what's, right, traditional layer seven type call and force it to a layer three or a layer four call with a certificate. So that every time, right, I'm gonna I'm old. A JDBC pool opens, it's gonna pull a certificate from the local store that is automatically being rotated outside of what the normal developers are seeing. And then that every thread on that connection is using that to authenticate at a layer three, layer four to your database endpoint on the other side. And then that credential can at least be rotated on a regular basis that takes it out of the developer's hands. And, like, those things are, right, and that's literally, like, a built in function of something like a Hashimoto or an OpenBao if you're on the forks yet.
[00:52:39] Tobias Macey:
So as you have been working in this ecosystem, as you engage with teams who are managing the data platform and data infrastructure, what are some of the most interesting or innovative or unexpected ways that you've seen them address some of these challenges of identity, credential, and access management?
[00:52:57] Matt Topper:
I will say that the Kubernetes push and some of the things that came out of Google with the rotation of the credentials, the chaining of the identities as it moves through the system. Right? As you can imagine, even like, Zanzibar was fully built for the problem of, I wrote a Google Doc. How and who am I gonna share it to? And, right, if you those of us that use Google Docs, you know you get in that sharing window and you've got people in your organization, out of your organization, anyone with a Gmail address globally, anyone with any address globally. You've tried to put that into groups. You then map that to roles that are inside. Right? Like, it's super complex. And, right, Zanzibar is great, as I said, but it takes some engineering. But the true data nerd in me loves reading the Zanzibar papers and how that was implemented and how it decouples but also tightly enforces.
So from an interesting perspective, that whole chain of the certificates and applying the policies is awesome. But from a simpler perspective, right, a lot of times, just simple database proxies, where if your database doesn't have an ability to do filtering inside of it, doesn't have an ability to really differentiate different credentials from different sources coming to it in an easy way, A lot of the database proxies out there will do that layer for you and then give you a way to extract away, here's which username and password are connecting to this or which certificate's being used, and then modify the where clauses.
Say, oh, for these tables, you're gonna stripe it by customer ID. And you'll find a lot of that in, right, and those patterns in, like, the SaaS software forums where organizations are, right, having giant databases serving thousands and thousands of customers and using kind of those proxies to limit those. It's a very easy way to get things like row level, column level security out of, like, sometimes traditional tools that won't support it.
[00:55:05] Tobias Macey:
And in your own experience of working in this space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:55:16] Matt Topper:
The creative ways that developers will try and get around and join your data. Right? Like, they'll pull as much data into memory and join it from places you didn't expect it to be joined. And, really, there's no solving it, unless you're really looking at the level of, the queries that are coming in and being like, ain't no way you're ever gonna return more than a 100 rows on that query if you're doing this right. So, I mean, that's some of this. And we've all seen it. Right? Devs will dev, and devs will figure out a way to get the requirement fulfilled against our best intentions.
So really just trying to look at kind of that audit data and having the controls in place to protect from it. That's yeah. Sadly.
[00:56:00] Tobias Macey:
As you continue to work in, work in this space and work with teams, particularly in the context of data systems, what are some of the developments or new technologies or protocols that you are keeping a close eye on or any of the, hopes and predictions that you have for the future of access management in the context of data?
[00:56:25] Matt Topper:
So as much as AI promises to solve everything in the world and put us all out of jobs, right, I think Larry Ellison told us we didn't need DBAs twenty five years ago anymore, and we see how well that's gone. I do think there is a lot of opportunity with AI to solve a lot of the policy problems. Right? We didn't even address the challenge that these policies are normally written in documents, like, literally PDF files by some regulatory group that then has to be translated through your own corporate legal teams that then somehow hopefully gets translated into actual policies that a developer can implement, I think we'll see better ways of doing that with AI more quickly. And as those policies change, be able to say, here's an OPA regulation set for HIPAA. Here's an OPA regulation set for right, NIST 853 compliance or, right, we'll just say zero trust because we haven't done that yet. But, right, like, what is a zero trust data tagging format, which is actually a thing that is built off of OpenTDF that NATO's do. Right? Where they define, okay, here's all the data tags for the policy.
Here's all the data tags to translate to policies. So I think watching that space is going to be interesting. There was a project in the identity space that was trying to create a single policy administration point. I believe it was called Hexagon was the project where, right, there's one administration point, but it would put out the policies in ExactML and Rago and CDER that they could consistently be applied across systems and services. So I think the rise of kinda data protection catalogs that then can be applied to datasets It's something that I'm hoping finally catches on and actually is a thing. And we've talked about it in a lot of different ways over the last twenty, thirty years, but no one wants to direct those policies. It's tedious, and I think if we can build some tools around it and use some of the AI capabilities to do that, we should be more successful.
[00:58:38] Tobias Macey:
Are there any other aspects of this overall space of identity and access management and credential management in the context of data systems that we didn't discuss yet that you'd like to cover before we close out the show?
[00:58:51] Matt Topper:
The only thing I'd like to say is please reach out to your identity and access management teams. We keep making our own data lakes with stale data, pulling them from sources that you all have much better access to and have already cleaned up the quality of those from the source systems in your data lakes, your data warehouses that are be should be centrally managed. And we need things like HR records. We need things like the contract information so that when we onboard a new user from an outside organization that's gonna come to your work on a temporary basis, when that contract's supposed to expire so we can turn off their access on the last day rather than hoping and praying that the person running the team remembers to go into the system and turn it off. Same thing with training data or right? All of that information, I see most identity teams replicating on their own.
And a lot of times, they'll stand behind and say, well, it's in LDAP, and you guys don't speak LDAP. That's cool. Right? Like, please reach out to the identity teams. Please help them see the power of data and clean data, and enable them to this this is always my hook when I have to go talk to the data teams to go, like, talk to our customers' identity teams of, like, hey. You know what the identity teams hate? They hate getting blamed when someone can't log in to anything, but it's because of the bad data they got from the source systems. So if you can tell them, hey, we've got this data catalog and this data warehouse that cleans all of this up, and here's the registry that says when the position code is wrong and you make a bad authorization and get blamed for it, you can put on your site, hey, here's what we know about you. And if any of this data is wrong to you, here's the help desk or the email address to go get it fixed. Don't blame the identity team. They'll love you to death. Right? Like, as silly and stupid as that is, they're building those systems in their own silos, and it's just slowing down innovation for all of us. And, hopefully, at the same time, we'll be able to give some things back to the data teams of, hey. We've got these policy engines. We've got these roles defined.
Can you reuse them at the data layer? Or, hey. We're happy to work with the application teams every day to get their roles in the groups and get those managed and get them integrated with single sign on. Is there someone on your side we can turn them over to to make sure that they're understanding how to access the data securely? And just try and build those relationships because, right, I we're all trying to solve the rogue developer trying to get things out the door problem from the different ends of the application. If we work better together, I think we're gonna see much better results for the companies we work for. Yeah. I absolutely agree on all of those points, particularly as somebody who is responsible for both the infrastructure
[01:01:45] Tobias Macey:
and the data layer. So I I I'm at least in the, privileged position of being the person making the decisions on both sides so we can have a little bit tighter, coordination. Well, thank you for having me today. Of course. And for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:02:12] Matt Topper:
What I'm seeing a lot of is the gap isn't in tools. It's not in storage. It's not in speed. It's in trust composition. Right? And we can say zero trust is overloaded as shit, and I already alluded to my thoughts on that term. But we're still missing this layer. Right? We've got catalogs for data. We've got pipelines for transformation. We've got dashboards for insight. But there's still this missing layer of the trust of where did that data come from, how is it allowed to be given back to people, and then from the other end, who are the users, have we attested them, have we attested those services, and how do we know the chain that they got to the data layer through.
And that to me is the biggest gap. And if we can really start identifying what that trust composition is from both sides, I think some of these right. And then that obviously comes with the wrapping in the metadata around those things. It's in these policy problems and decisions to get easier. So that's my thought. It's a big problem to solve, but the good news is data folks tend to, like, big problems to solve and, take those challenges on. And please reach out to us and more than I'll provide some links to where the identity community tends to hide online. Please reach in. This we're trying to solve the same problems and need to come together.
[01:03:41] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share all of your thoughts and experience on this broad and constantly shifting problem of managing who can do what with which data, when and why. It's definitely a very complex problem to solve, and we've got some point solutions, but it's far from being a, holistically managed ecosystem. So I appreciate all of the work that you're doing to help mitigate some of those complexities, and I hope you enjoy the rest of your day.
[01:04:10] Matt Topper:
Thanks. You as well. Take care.
[01:04:20] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to guest and topic: identity in data systems
Matt Topper’s background and early data security lessons
Why data platforms complicate identity and access control
Chaining identity with JWTs and policy enforcement patterns
OPA/Rego proxies and centralizing fine‑grained policies
Identity basics: IDPs, tokens, and end‑to‑end auth flows
Beyond databases: files, buckets, and masking challenges
Policy languages: XACML, Rego, and Cedar in practice
Vendor lock‑in, procurement leverage, and abstraction layers
Finding PII and propagating labels via catalogs and lineage
OpenTDF and cryptographic ABAC for data objects
OpenFGA, Zanzibar trade‑offs, and retagging costs
Organizational buy‑in: governance, brown bags, and guardrails
Machine‑to‑machine access: short‑lived creds and attestation
Selecting enforcement choke points across interfaces
Streaming and the need to carry identity as metadata
Partition‑key encryption, key management, and cell tags
Practical first steps: auditing service accounts and logs
Working around legacy tools: certs, proxies, and rotation
Lessons learned and innovative patterns from the field
What’s next: AI for policy, unified administration, and catalogs
Collaboration between identity and data teams
Biggest gap today: trust composition layer
Closing thoughts and episode wrap‑up