Beyond the Perimeter: Practical Patterns for Fine‑Grained Data Access

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL.

The result, inflexible infrastructure that can't adapt to different workloads.

That's why Cash App and Cisco rely on Prefect.

Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows.

Each model runs on the right infrastructure, whether that's high memory machines or distributed compute.

Orchestration is the foundation that determines whether your data team ships or struggles.

ETL, ML model training, AI engineering, streaming, Prefect runs it all from ingestion to activation in one platform.

Whoop and 1Password also trust Prefect for their data operations.

If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect.

Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?

DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure perfect parity between your old and new systems.

Whether you're moving from Oracle to Snowflake,

migrating stored procedures to dbt, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed

price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafull

to book a demo and see how they turn months long migration nightmares into week long success stories. Your host is Tobias Macy, and today I'm interviewing Matt Topper about the challenges of managing identity and access controls in the context of data systems. So, Matt, can you start by introducing yourself?

Yeah. Sure. So Matt Topper, president of UberEther.

Been trying to solve a little bit of this,

world of identity and data coming together for, sadly, we were joking,

about thirty years. And, obviously, I'm not very good at my job because I think it's a mess for everyone still. So

And do you remember how you first got started working in the data space?

Yeah. So I was fortunate enough right out of high school. My dad had called one of his old mentees

that he used to work with and said, hey. My kid needs something to do over the summer before he starts college. You got it. And, luckily enough, I got thrown into a company that had all of the vehicle, you know, registrations for the entire world. And thirty years ago, those were some of the biggest databases before kind of the Internet started gobbling every move we make and

ran into a whole bunch of great data problems there of sales records being aggregated across every major manufacturer.

Right after a car is sold, the manufacturer has no idea who uses it, so we aggregated all the data across

essentially every state, every country, and then kind of sold that back. That turned into

the problem of

everywhere I went, people were like, don't do the security thing. That's where you gotta get fired if you make a mistake. That's awesome. I wanna do that for the rest of my life. And, since then, I've been trying to get all of this working together.

As you mentioned,

despite having been working in this space for

decades and it not being a solved problem and in some cases, maybe even becoming

worse,

I think a big part of that too is that at least in the context of data systems,

for some brief period of the history of working with data, the predominant

means of dealing with this challenge

was

to

buy some sort of vertically integrated

stack from an enterprise vendor. And, of course, they had handled all of the identity and access control challenges

to some approximation.

But now we're in the much broader and more composable ecosystem

of bring your own solution for x and then stick it all together with whatever sort of superglue you can find.

And, of course, every system wants to own their own piece of that identity and access challenge, and none of them do it in the same way. So now we're stuck with an even bigger integration problem. And I'm just wondering if you can just talk to some of the ways that the data ecosystem

is uniquely challenging for this problem of access control and identity management

and auditability

beyond just the

standard, in air quotes, case of managing identity for a single application.

Yeah. I was lucky enough, and I'm sure a lot of people on the podcast, when I say the word Oracle, either cringe or cheer with joy. I don't think there's a middle in that, but,

was part of the Oracle team that worked on some of their original label security pieces. That got into some interesting spaces when we worked in federal government because

we would be sharing data across

an unclassified secret secret network with the right? The same user on those different networks who would be able to see different things based on their classification level, and then within that, federating out to other countries

accessing that data across the world.

And we ended up having to extend the JBC connectors at that time to say, okay. Give me the Java thread in the container where the user identity was being persisted.

Pass that as part of the label security. We would literally pick the thread out of the pool,

assert the user identity on that thread,

and then that would allow us to filter in label security with a can and can't see. But, obviously, that was super specific to Oracle. And

luckily enough, we were on the inside that we could modify drivers. We could modify the back end, and that doesn't

really exist across the ecosystem today. Some of the things I'm seeing

starting is some of the databases have started to accept JWT tokens.

So you're able to actually start passing from

the user's browser, if it's in that case, or the service account, to the API, to the API,

and chain the identities through to say, I know the paths that this data is coming in through. And

the more you can look towards those, the better off you're going to be. Right? Because that gives you every hop along the way who the user is, who the accounts

the service accounts they're going through are, maybe even some of the context of where in the world, and then at least use those to use the JWT to call out, get the person's roles, and then apply that to any policies you have in the database.

Obviously, your data integration layer may have their own way of doing that, which gets into, as you said. Right? They all have their own policy rules and their own

special,

languages for that.

Where I've seen some successes,

There's a language called OPA, open policy.

Or Rago is the policy language. OPA is kind of the enforcement point. And I've seen

not built into a lot of the databases

or data lakes yet, but an ability for

basically, a proxy in front of that to take those calls, say, hey. Based on this

user and this path, what are they allowed to query or not? And it'll return a list of

either roles or different tags you've got on the data that you can then apply to your where clauses dynamically. And that's that's been quite successful in some of the orgs we've worked with.

So you've mentioned a few different technologies

and

protocols that have been emerging largely as an outgrowth of the investment in cloud native architectures and the Kubernetes ecosystem and the growth of microservices

for better or worse.

And that's an interesting Can I be that opinionate?

And so that's a set of techniques and protocols that have grown out in one ecosystem,

and it's starting to filter into the space

of data platforms,

I think largely because

the data engineering role

has grown overlap with the platform engineering role as we gain some levels of sophistication in how to deal with these increasingly complex systems.

And before we get too much further down that road, I think it's probably worth moving up to a little bit of a higher level first and just enumerating

what we mean when we talk about these questions of identity and credentials and access management

and some of the different ways that they manifest,

particularly from the end user perspective of, I just wanna be able to access the data. But then from the organizational perspective, we need to be able to say, okay. You can see these attributes of the data, but not these other ones because of regulatory or data privacy considerations.

Yeah. So taking a that step back, right, normally,

inside of a browser,

you're gonna have some sort of authentication event, which normally will go to an identity provider. So most organizations have that single sign on, single login page. A lot of times, that's gonna be Ping Identity,

Okta, Microsoft,

Entra ID,

or your Google Workspace.

And, right at that point, here's the authentication event. Here's how strong that authenticator

was. Because sometimes you might say, hey. If you're using username and password, you're not getting the golden goose. Right? You're not getting the good data that we have because that's so easily compromised from a username and password perspective.

So once you log in there, most of the time, you're gonna get either a SAML assertion for the app or you're gonna get a JSON web token through OAuth or OpenID Connect. So that's what your front door application is gonna see, and it's gonna be either a unique identifier that says go back to the IDP and look it up and tell me who this person is, which is the

my preferred method because

we all know the browser is the most easily compromised thing in the entire path. And, right, at that point, that token gets signed to the service, and then the service has to present their key to say, hey. Who's Matt? Who or who is this? And then the service will say it's Matt. So now you have two places you need to compromise

to get that stolen. So, right, that gets that. That now knows who the person is.

Well, at that point, a lot of times in these Kubernetes ecosystems,

that web application

then has to call an API. Right? And that web application is gonna get a short term certificate if they're using something like Istio.

But at the end of the day, it's the credential for a service account. So a lot of organizations will either embed

username and passwords,

aka OAuth tokens and secrets,

to call those

APIs.

And then with that, call back to the IDP as a security token service and say, hey. I'm now making a call to this API

on behalf of this user. So now we've got a chain

of those. And finally, after that, that chain continues down the path to get to the database. And, right, a lot of times, the database doesn't even get to see that, and that's where I'm and I think as you said, seeing more of the JWT tokens being able to be accepted at the data layer. So you can now say, okay. I know this has come through a trusted path. There's no way that that data's been compromised.

And now based on the user, where they're coming from, the services they're accessing, I know that I can only release this data or modify my where clauses, my queries this way or another.

The other interesting

wrinkle in this space

that has been growing in the past few years is that databases

aren't even necessarily

the storage and retrieval mechanism for the information and the access, particularly as we move more into this AI era where everything is either files on disk in the cloud somewhere or an s three bucket or some sort of vector store. And so now we have to be able to manage access controls

at a larger number of touch points where, for a while, you could just say, as long as my database controls are sufficient, then I can feel good about people being able to access the things they need to access.

Now maybe you have a SQL interface using something like Atrino

or some some other federated query engine,

but maybe you're sending people directly to those files in an s three bucket. And so then you need to have those same types of access control

available

in an environment where maybe it doesn't even understand anything about the column structure, but you still need to be able to manage some sort of masking or encryption to make sure that only people who are permitted to have access to address data or grade information, etcetera.

Yeah.

And that's

like, I can say in the identity community side, that's a having lived in the identity community and the data community for so long, there's a lot of parallels of the same problems we all have. Right? Whereas we're not always embedded into these application teams that are designing the solutions, and they're just saying, I need the access to this thing.

And the biggest thing is there is no good

global shared language for access control at this point. Point. So, right, you could say, here's core screened.

You've got role level access.

But at the end of the day, in the data world, those are always too simple. And at the end of the day, that then becomes too insecure because it's like you get access to that bucket on s three or that folder in that bucket on s three, but doesn't say, oh, in that parquet table, here's the columns. And within those columns, you only get

where values is equal to x, y, and z, which then gets you into the fine level row and column

controls,

but they're completely unmanageable at scale. Right? Every single time you've got a new bucket, a new folder, a new file site, you're now having to write all these super fine grain policies. And we're we're starting to see more success, and I'll say it's it's not great right now, but it's kinda one of these things where we've gotta figure out a way to organically get there is the policy abstraction layer. So in the identity world we live in, there's, right, a policy enforcement point, which

in right. Just say, an s three bucket is the front door of who can access what files inside of there, and then a policy decision point, which is essentially abstracted from that,

that then says, okay. Based on the user

and the folder they're looking at, what are now more fine grained controls to look at? But then it's not embedded

into all of the apps and embedded into all of the code. And that's where, traditionally, we've seen things like the XACML language, which is incredibly complex, incredibly hard to write to.

The admin interfaces to write policies are, in most cases, garbage.

And

then lately, we've seen the growing up of the rego, OPPA, as well as Amazon came out with a policy language called CEDAR

that has actually grown quite quite well that is very good and focused

at the data problem. So I know a couple of the team members on that team, and I believe,

I believe Snowflake

is

adopting Cedar, if I remember correctly. But it's been a little since I looked at it. But that'll at least give us an ability to say, hey. Here's a set of rules for

GDPR

or for HIPAA policies that says,

here's what's gotta be encrypted. Here's how it's gotta be encrypted. Here's how it can be released or not released. And then you can start changing those policies in one place and reapply them to the data throughout your entire ecosystem versus trying to have to craft these one off solutions.

The other interesting piece, as you said, is we have these abstraction layers where we have the Rigo policy language.

I know that there's the Oso project that has their Polaris language. You mentioned Cedar.

And so that is

one integration point that we can use to move more of the

common logic

to act as that,

multiplexer

for the problems that we're trying to solve. But then you end up in a situation where maybe you're using a tool or a platform

that doesn't have an easy integration point to be able to act as that translation, or it's a vendor managed solution and you just get whatever they wanna give you. And so then you're in a situation of having to either build your own glue or just throwing up your hands and saying, I give up. And so we we we have this problem of we either have too much granularity and control, and so it's an exercise in

frustration just trying to even figure out what are the things that I need to say and do. AWS's IAM policy language is probably the best example of something that is too fine grained and too complex, and nobody ever knows for sure if what they're doing is actually right, including anecdotally people who work at Amazon and build the IAM system. Or you have something where it's

too coarse grain, and it's either just a Boolean, you have access or you don't. Or maybe it's, to your point, you either have access to this bucket or this folder or you don't, but that doesn't give you the necessary level of consideration. And I'm just wondering how you're seeing people

start to try and

address this inherent complexity and the highly dimensional nature of this problem space where we want to be able to have attribute based access control where we can just have these conditional checks that filter all the way down, but either creating and propagating them is impossible or too complex.

Yeah.

Sadly,

the only way to force it that I found across the vendors is and, right, this is an impossible task for most data and identity engineers is you have to force it on your procurement teams when these tools are being bought and say, unless you're supporting these standards to externalize the data or externalize the policies,

we're just not gonna buy you. And, right, just depending on the size of your organization, you do or don't have that level of push.

And a lot of times, it's gotta be a lot of big organizations

making that push together

and working towards it. Because

having worked for a lot of the large vendors or with them over time, this is one of the wonderful ways they love to lock you in and make sure you can't move to somebody else because you are gonna spend a ton of time writing these tools and policies or these policies to enforce your data, and then you go to move to another tool and somebody goes, oh, that's a three year project or even a six month project. Well, we don't have time, and then it just continues to proliferate out. At the end of the day, it's either

that approach, which see is seemingly more successful from what I've seen,

or you're writing your own abstraction layers on top of these tools that do the enforcement

and then just reflect

the filtered queries back to the tools themselves, which then becomes a one off continuous nightmare as well to manage. There there truly is no good solution

I found, and good luck selling that to executives and, right, getting the time to build those things too.

I think another aspect of the complexity

is that

even if you do have a policy language where you can say

you have access to PII,

you do not have access to PII, and being able to enforce that, it's not even always clear, one, what constitutes PII because there are different levels and gradations of that, but also whether or not the thing that they're being granted access to contains PII because that's not something that is always identified

at the ingest layer or at or even if it is identified at the ingest layer, maybe it's not propagated downstream. And so just being able to

manage the identification

of which data sources and which datasets and which attributes are the things that actually need to be protected, assuming that you have the necessary controls to enforce those protections.

And I'm just wondering how you see people try to solve that, particularly now that we're moving into the space of unstructured datasets

where PII is an even more difficult challenge to identify and address.

Yeah. I think a lot of that

where I've seen organizations have success is it's the idea of the data governance layer above the data itself and how that flows through your organization and where it goes.

So whether you're building that into your data orchestration layers or,

I personally like data catalog systems.

Right? There's some good open source tools out there for that that says, hey. Here's all the

endpoints we've got for APIs across our organization.

Here's whether that's data related or, right, service

APIs on the other side. And here's all of the elements that are being returned. And then within that catalog, you can at least start as data engineers filtering down and going, why can't it say SSN?

Oh, that's the

super secure

node and not Social Security number. Right? But, like, by having a centralized

catalog that you can look to and kinda run over,

then, right, ideally, that correct log is also seeing your orchestration pipeline.

You can start to see all those basically, all the, relationships

between those pieces

and tie it together. Right? There's some really

advanced

ways that this has been done.

So I've worked in the US intelligence community for a long time.

And, right, there's always this fun problem of,

hey, we're friends with this group today, but we might be mortal enemies tomorrow. Or based on this conflict,

we're friends

to get this mission done. But as soon as that mission's over, we don't want them to have the data in. And there's been some interesting ways we've solved that

of wrapping the data

before it can be used and

providing our own APIs around that and some extremely complex key management capabilities.

But

most organizations don't have the time, money, or resources to do that, nor is it something that's even near a standard today.

So, yeah, that's

there's

no good answer. Have you seen anything in stuff you've done?

Not

any

universally applicable

method. So to your point, data catalogs, I think, are probably the most holistic way to do it, presuming that you have a means of being able

to do some of that identification

whether in an automated fashion or even in a manual fashion. And then as long as it has a good

lineage tracking capability, then you can use that to propagate those labels to downstream data assets and then be able to figure out where do I need to manage these access controls.

So

the tool that I'm working on integrating with and investing in is open metadata, which has some of those capabilities.

The other piece that I've seen some investment in, and, unfortunately, it's now behind a paywall, but the

DBT fusion engine,

which, was previously

I'm blanking on the acronym that they used as their name, but they were a team that got acquired by DBT. But they were a rest based engine that would allow you to have

more

descriptive

typing information

on the columns for being able to say, this is PII or this is PII of this sort. This is an address. This is a Social Security number, etcetera. And then through the DBT

execution graph, be able to propagate those labels as well and manage some of that granularity of access control. But beyond having some system that understands

the

way that the data propagates and proliferates and

allowing those labels to follow those different transformations,

there's not really any other good way to do it other than just being very manual and draconian into your practices or just giving up and saying, I hope nothing bad happens, which is obviously not the right answer.

I hope our CISO has enough of an insurance policy on their job

that

which is sadly I see during the CISO contracts.

They, hey. I've told you that there's no way we can protect this without you guys making the investment. The board's made a decision not to make that investment.

So we're gonna put into my CISO contract. You're gonna pay me for two years if we get breached. Like, that's sadly the point we are with data security at this point for many organizations.

I did take a quick look. So I mentioned some of the things that have been done with the intel US intelligence community and some of the NATO partners we've got. There is an open

format called OpenTF,

which is the trust data format. And

I can say

NATO is pushing a lot of the large vendors to start adopting this. So it is a way to cryptographically

bind

attribute based access control policies to data objects

as they either,

right, are sitting as files inside of buckets places. Right? Essentially, it's wrapping those files to then say, I need a trusted identity object to verify against

before I'm gonna unlock this data

for someone to access. Right? It's a part of the problem set.

It doesn't always get to the fine screen of controls,

but it's at least a standard set. There's some pretty large organizations with pretty deep pockets are pushing now, and I've seen it working quite well

at scale.

Composable data infrastructure is great until you spend all of your time gluing it back together.

BRUIN is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic and let BRUIN handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.

Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster.

Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.

Go to dataengineeringpodcast.com/bruin

today to get started. And for DBT Cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud.

Yeah. Another project that I have

encountered but haven't invested much time into

researching or playing with is open FGA, which I think stands for fine grained access and is based on the Google's Anzabar system.

And that's another cloud native computing foundation

project that has started to make inroads into the data ecosystem as well. So we are getting to a point where there are enough

broadly compatible

and open for integration

layers that we can start to build some of that unified

access control plane, but there's still a lot of work left to do. And then there's also still that problem of some of the systems that you're running are so old that they're never going to work with this, and some of them are so new that they're trying to build their own system. And so just trying to wrangle that complexity is an exercise in futility.

Yeah. And at least when I've used the Zanzibar based systems in the past, they are very fine grained, And you have to have your data structured

pretty tightly to be able to essentially, you're defining policies of policies

in Zanzibar in many cases. And

you really do have to know, like, you have to have very, very set patterns for it to be successful.

And

a lot of times, you end up having to retag a lot of your data if your patterns change. And, right, depending on how big your datasets are, that can be extremely

expensive to do.

The other

challenge in this space is

that for that governance protocol to be effective, you need to have

acceptance and buy in from the organization

to actually

force people to adhere to it because, otherwise, they're going to say, but I need access to the data, and my boss said it's fine. And so then the data teams are just left in another situation of throw up their hands

and hope for the best. But just wondering how we can I guess, as we move to a space where we actually have more of that easily enforceable

fine grained control where you're not preventing people from doing their work just because they can't get access to the data or to a particular table because there's too much sensitive data in it, but they can see just the things that they need? Maybe that's a way of reducing that point of contention

and getting more of the organization

on board with these access policies and control policies. And I I think the other

tension that I've seen too is that you can

aggregate data and anonymize it that way, but then you have people saying, oh, but I need to be able to access every single row for purpose x, y, and z. And it's like

so just managing some of that education

and

facilitating

the needs of the organization

without just being a gatekeeper for the sake of gatekeeping.

Like, day to day side of the world, we run-in the exact same thing, right, where everyone's like, well, I bought this tool, and it's got its own user management interface, and then we're just gonna do it ourselves. And you're just like, jeez. Right? Like, no. You need to get in a centralized governance system. No. You need to reuse these roles because that's how you get away with

letting all the data out. So a lot of times, it just becomes the, hey, here's the interfaces we support.

Here's how you get access to it, and then it becomes

part of you right? However, your organization does governance of rolling out applications to production,

right, risk management framework type stuff and saying, hey.

Essentially,

it has to be a monumental event to overcome the data team's policy

of how you access these things. And you've gotta have

CIOs and CTOs and CSOs with strong backbones that say, no. Fix your stuff, devs.

We will

miss this deadline

because you didn't do it right.

And, right, that comes to board pressure,

but we have at least found that if you can have kinda we we call them brown bags of, like, hey.

Every Thursday, we're gonna have five of our best data engineers.

If you're rolling stuff out in our organization,

they're available from noon to one

every other Thursday, every Thursday, and

join this call if you've got questions about integrating with our platform. And then, right, that leaves the open. And then when they get to the rolling things out to the governance board, they can take, did you show up to brown bags? Did you work through these things? And then at least the data teams and the app teams

have made the effort to have the discussions

that the boards can say, no. You didn't do the thing and push back that way. Alright? And then, obviously, it becomes someone's job to take those brown bags and turn them into facts, turn them into backlogs of, yeah, there is a gap here. We actually don't know how to solve this problem for this type of app or this type of new streaming interface that's coming in. And then, right, fax get updated, that type of stuff. But there's, again, never a silver bullet between either of these teams on either side.

And so, generally, we've been speaking

from the perspective

of human operators trying to get access to data to be able to perform some role in their job. But then there's also the other side of machine access and managing appropriate controls there where, generally, if it's machine to machine, we just say, oh, well, I just trust that the machine is going to do what I told it to do, so it's fine. As we get into more

agentic and AI driven workloads, obviously, that gets even more fraught, but I'm just wondering how you're seeing teams

either

properly address and maybe some of the ways that you're seeing them fail to address some of those considerations

of appropriate

access control and role based definitions

when there isn't necessarily a human in the loop or when there's a machine that is doing some data fetching that will eventually be displayed to a human.

Yeah. I I've said this a lot lately, especially with the AI. We're all in model context protocol and right. It's obviously hitting the

data community hard right now because everyone wants an m p MCP server to access the data behind the scenes. And I will say the Google project that they have

for data access

is really well done and thought out.

People haven't played with or looked at that yet. I can probably send an oh, that's cute. My Google phone just picked up thinking I was asking you a question. But I can send you a link to that project, but it's that you can include in show notes, but very, very well done going to traditional data sources. But when

Anthropic first put out that spec, the security section

literally listed,

whack, whack

to do.

Like, it was, yeah. We know this is a hard freaking problem. We're just gonna put out the spec and figure it out later.

And that quickly became a like, some of the first patterns for MCPs was, here's how to go into your browser and copy the cookie

and give it to the MCP server to then access all of the data. Alright. I think both the data and the identity engineers

are, like, having heart attacks of how terrible that is as a pattern.

In general,

what we like to see and

this is coming out this did come out of the Kubernetes ecosystem, but

has very much been proven on traditional workloads,

is to use more short lived credentials.

And using,

really, attestation

of the servers or the services

that are running

to then be able to grant those credentials.

So even,

right, even in a Kubernetes environment,

that container

has to be admitted

into the cluster.

Well,

if you can prove that that container

is

the hash is essentially matching

the one that comes from your container registry and hasn't been modified,

and it's running on a Kubernetes host

that has a TPM on it, right, that every even virtual TPM in every cloud or every cloud environment has,

well, then I at least know that that container runs these workflows

on trusted hardware that we knew about in management 18

to give it a credential. And then that credential is normally either a JWT token

or a PKI certificate that'll rotate

every x amount of hours.

So then from a data layer perspective, you can now say, hey. This is assigned, asserted, tested

to

the workload

level, not just the device level, the workload that's running.

And

for an attacker, if they were to come in and steal that, they'd have to remain persistent there and pick up a new token every two, four, 24, whatever that policy is, hours, and that makes the attackers way easier to spot. It also makes it easier from a data engineering perspective to say, oh, that's tied to this workload, and they're only allowed to query these tables,

these columns.

And if somebody tries to go outside of that or

they're trying to use one of those tokens from a machine that hasn't been attested, then it's way easier to spot those things

from an attacker perspective.

And that's sat right? Same type of things with pressures of getting things out the door.

A lot of the problems I've seen is you've got these long lived tokens and have, right, pushed all of our users to get rid of their username and passwords and move to pass keys or FIDO tokens

or

having all of our credentials put into a, right, privilege access management blocker that you have to go log in with your SSO, and then you get the username and password to copy into your app. I apologize for the idea entire identity industry for that pattern because it sucks. But all of that, like, we've changed that for the users. But at the end of the day, we've been giving our applications

just another username and password,

and maybe we just call it a token. But then instead of being limited to the individual human, it's the entire database.

And it's actually

way more insecure

and opens the doors to attackers much, much easier. And that's what I've seen all of the pivots

from an attack perspective be lately is, right, it'll come in as a human, but immediately, they'll give it to finding

shared credentials on services.

So by using ephemeral

credentials,

limiting down

how long they can live is forcing bad guys to get caught, and then also

allows us from the data engineering side to better limit

what things can be accessed. And then if the policies have to change,

right, this is part of the the Zanzibar

side of things from Google, but also

what is publicly known as Istio now. Right? Sometimes when you mint that token,

you're gonna put those policies

in the token and what returns true for those four hours.

So

now when that gets to the data layer, you can use what's in that policy. And if the policy for what that data can access needs to change, well, you're gonna have a window

where, yeah, for the next four hours, that token says the policy is XYZ.

Well, in four hours, we know it's gonna change to be a, b, and c. So now you're not having to build additional things into your essential data layer to then get the updates from the PDP that's coming through the token on the front door, which actually may be

see a pattern for many people to implement.

I think too that

one of

the bigger challenges in terms of data architecture is just figuring out

what is the choke point or set of choke points that I'm going to rely on for being that policy enforcement layer because you can go all the way down to the metal and say, okay. I need to do my policy enforcement at the bits and bytes layer or at the layer of the individual files and how those are laid out. Or you can say, well, I'm only going to allow access to that underlying infrastructure

through these three different

interfaces where maybe I have my SQL layer for my data warehouse and my parquet files that are in s three buckets, etcetera.

And maybe I also allow for some sort of notebook system to be able to access them, but the only way that the notebook is going to be able to authenticate to those is either through that SQL layer or through AWS IAM or what have you. And then maybe the only other way that I'm going to allow for this data access is through some sort of controlled export mechanism that is managed by my orchestration system. And so if you want to be able to access this data, I'm going to be the one to hand it to you. You're not going to get it yourself.

Yeah. I mean

and or it's just controlling those interfaces

and not allowing the back doors of, like, what? This group has their own special

flavor that doesn't follow those.

But, yeah, that's really

right. Again, it's governing

the data layer and governing access to it.

And the other interesting wrinkle that comes in with data systems is that maybe you have all of the best controls for all of your well known data ingest and egress paths,

but it's a constantly moving target. There's always new data systems being integrated or new use cases being applied.

And to some approximation, you can manage that through having those choke points where these are the supported interfaces.

But in particular, as you move from batch into maybe a streaming ecosystem,

that brings in a whole new set of technologies

and access patterns. And I'm wondering how you're seeing that muddy the waters of how people are thinking about the security controls over those data flows.

Yeah.

I'm just gonna blame the network guys.

Right. And sadly, it's partially true. Right? For the last thirty years, we've been designing our systems where,

hey, as long as you're inside of our network boundary, you're cool. Like and we're gonna put these big walls up to get you in and out of that. But, right, as we've moved out to multiple cloud providers, multiple SaaS providers,

those walls have, crumbling down. But those groups got all the investment and the dollars over the last thirty years that we're having to play catch up. And

I think it's at the end of the day, we need to figure out a way to standardize

how identity

travels

with the data. And,

traditionally, right, it is that, okay, it's gonna be a set point in time with a set

place

to control how this access or how the results are gonna be returned. And

we need to figure out how to move access control to be

part of the table definitions, part of the file definitions, and not just the network perimeter. Right? If we can get identity to become

metadata,

that's both, right, the user,

the services

that are the user's connecting through,

as well as metadata on the actual data objects,

that whole authorization layer just becomes automatic. Right? It becomes a union join of all of the things of the past that then makes it

super simple

to write the policies against. But when

the data is flowing without having that metadata

of

what these objects are

is how things continue to get out of control. And that's part of what OpenTDF is trying to solve is

every single data object that flows

in, out

does have metadata that's associated with it that then

the authorization decisions have to be made against.

So, right, I think as we're designing these systems today,

putting the metadata on your catalogs

that then can allow them to be policy driven. Right? Then you can integrate things like OPA and Rego into your DBT models.

You can then embed

your attribute based access control for data sharing contracts. So, right, once you create that export

okay. Well, I'm gonna give you this export and, right, it's a giant CSV file.

Well, who's gonna be allowed to unlock that CSV file? And I'm just gonna put it on an s three bucket somewhere and tell you to go pick it up. Well, once they download it, where is it gonna go from there?

And, right, putting that in an encrypted format with the metadata around it that says, yeah, you can load it, but you can only load it into

this

even as if we were able to say this organization's domain

would be a huge step forward

from where we

are today. Or ideally, right, this organization's domain in this service,

but when it's being loaded,

here's the metadata that needs to be applied to protect it. So that in the future, if I prove

you've essentially validated our con invalidated our contract together, I can revoke the access key that lets you unlock that. And that that is what TF is giving us to. It's it's a big mental shift

for most data teams to get there.

But after seeing the size of the data and the amount of the data and the number of different ways it gets fed up across the globe, right, at the end of the day, TF grew out of the NSA. Right? So you can imagine

the mister Snowden to let us all know how much data they've had and continue

to maintain, but it does work. It's just

a big mental shift

that has to happen.

Another

interesting

approach that I've seen recently with the Duck Lake table format is they've introduced the capability

of being able to handle some of that row level access control

automatically because you can set the

key that you want to use as your filter effectively as the partition key, and they will automatically handle partitioning the data based on that

value in that field

and automatically apply separate encryption keys to that data so that the only way you actually can read any of it is if you've been granted access

to read that partition and and read the decryption key to be able to retrieve the data, where maybe you can get the files out of s three, but it's gonna be gobbledygook because you don't have the decryption keys necessary.

Yeah. That's a smart approach as long as you are very good at managing those keys. Right? And that's where a lot of organizations

fall down. Alright. Similarly,

there was a offshoot of Hadoop years ago,

term, Accumulo.

And when they first started building Accumulo,

they would embed the policies in the row. Basically, it's cell level security,

and they would embed those policies at the cell level. Well, you can imagine as you're trying to query across all of those tables,

it was a nightmare from performance when you start locking things down. So

we started moving to tagging at the cell level,

basically, UUID,

and then the policies were abstracted outside of it. That the same way you're saying, I haven't looked at the DuckDV

side yet. I'm that's gonna be in late night tonight project for me to see how they're doing it. But,

alright, evaluate those policies on the outside and then return

the unique labels of the UUIDs.

And now that makes it a very easy join to determine what can and can't be released out.

But adding that extra encryption layer that that's

that you said that's doing is gonna be yeah. I'm gonna dig into it because I've seen

the issues, problems, and,

hopefully, they've learned from the past or maybe I'm gonna put in a PR later tonight to give them some heads up of what's coming.

Yeah. So they they've done a very good job from what I can tell. But for

teams who are trying to tackle these challenges of identity and access management, we've just enumerated all of the ways that they're doing things wrong, and it seems like a a huge lift. So I'm just I'm just wondering, what are some of the

easy lifts or low hanging fruit the teams should be thinking about as they continue to battle this constantly changing ecosystem

of how do I make sure that the data I'm responsible for is being accessed appropriately.

And maybe that's just a matter of making sure that you have a centralized location for audit logging or just wondering what are some of the concrete steps that you generally advise teams

make sure that they have in place even before they get into some of these more esoteric and complex solutions like the OpenTDF?

Yeah.

So I think the biggest thing I would suggest for organizations

that are actionable to move forward is take a look at every single

service account that you've authorized

to access your data and do an audit what they can and can't access based on those service account roles,

policies that you've put in place today, and really

understand

why they're querying your data for what information.

And I've seen way too many times where it's just like they need access to all of the data for all of the things,

and they've got read, write on pretty much every database table or every

file within the s three bucket.

That's that's just begging for data expo. And

take a look at

with those service accounts. Do you have policies in place

that

you know what

normal looks like for those queries from those people. Right? Do those queries 99% of the time come in

as a lookup for an individual organization, an individual person,

and then if and they return maybe, call it, a thousand rows or a thousand items,

would you ever be able to detect if they returned a million instead? Right? And a lot of times that gets thrown over

as hopes and prayers to a SOC team that right? We all know. They don't that it's nothing against the SOC teams that just

have too much data that they don't understand the context of being thrown at them. So, right, they're only

seeing five alarm fires

and, right,

every row in your database being returned,

it's too late. Right? The fire's already burning

when they're pulling the trucks up because the stuff's out the door. So that's where I think people really could get started. And once they start understanding,

here are the service accounts, here are the rotations on the service account. So what type of credentials are we be using?

How often are they being rotated?

Move that down into the well, then,

if they're being rotated, say, even on a thirty day

cycle,

what are the normal patterns look like just from

a broad perspective?

What are the normal requests, responses? Can we say if something falls outside of that? Right. Now

you're at a point that you can start getting into some of these more fine grains, but I will say most organizations we work with don't even have those course level controls

or being able to monitor and audit those course level controls

happening.

I think one of the other challenges

when you start talking about credential rotation is that the systems that are using those credentials

maybe don't have any concept of rotating credentials where you're expected to just put the username and password into a text box somewhere and then save it. And maybe it encrypts the credentials in its storage layer, but there's no way for you to be able to say, hey. Whenever you need a credential, call out to this other thing. Instead, you have to do the integration work of pushing the credentials into that system if it even has a way of doing it

in an automated fashion, and maybe you just have to do it manually. And that's definitely a challenge that I've encountered in some of the work that I do where some of my systems, I could say, hey. I'm just going to use the so I use HashiCorp Vault for dynamic credentials. So I'm just gonna use Vault, the secrets operator, and my Kubernetes cluster to fetch the credentials and keep them up to date. But maybe the system that I'm integrating with or using to do some of the data ingest

doesn't have any concept of being able to pull from an environment variable. It has to go into a text field somewhere. And so now I have to build that API bridge to push those credentials into the connection configuration.

Yeah. I've seen a lot of that as well. So the first step I see a lot of is right? It's they don't even understand how things work with the products they're using a lot of times, and they just know that if they put the username and password that the data team provides in this field, that it's gonna work. And they don't know what

pieces

and parts they can rotate and how. And, right, again, it's you gotta have some backbone

from the leadership teams of saying, no. You're gonna spend the time on this, and you're gonna do it the right way. And, yeah, it's gonna take two weeks to write a script, test, and validate that script to rotate those credentials,

but that's good for the team anyways.

And, right, sometimes there are tools that don't do that, and what I found

successful there is, okay, you can embed that username and password, but we're gonna take it out of what's, right, traditional layer seven type call and force it to a layer three or a layer four call with a certificate. So that every time, right, I'm gonna I'm old.

A JDBC pool opens,

it's gonna pull a certificate from the local store

that is automatically being rotated outside of what the normal developers are seeing. And then that every thread on that connection

is using that to authenticate at a layer three, layer four

to your database endpoint

on the other side. And then that credential can at least be rotated

on a regular basis that takes it out of the developer's hands. And, like, those things are, right, and that's literally, like, a built in function of something like a Hashimoto or an OpenBao

if you're on the forks yet.

So as you have been working in this ecosystem,

as you engage with teams who are managing the data platform and data infrastructure, what are some of the most interesting or innovative or unexpected ways that you've seen them address some of these challenges of identity, credential, and access management?

I will say that the Kubernetes push and some of the things that came out of Google

with the rotation of the credentials, the chaining of the identities as it moves through the system. Right? As you can imagine, even like, Zanzibar was fully built for the problem of, I wrote a Google Doc. How and who am I gonna share it to?

And, right, if you those of us that use Google Docs, you know you get in that sharing window and you've got people in your organization, out of your organization,

anyone with a Gmail address globally, anyone with any address globally.

You've tried to put that into groups. You then map that to roles that are inside. Right? Like, it's super complex. And, right, Zanzibar is great, as I said, but it takes some engineering. But the true data nerd in me loves reading the Zanzibar papers and

how that was implemented

and how

it decouples but also

tightly enforces.

So from an interesting perspective,

that whole chain of the certificates and applying the policies

is awesome. But from a simpler perspective, right, a lot of times, just

simple database proxies,

where if your database doesn't have an ability to do filtering inside of it, doesn't have an ability to

really differentiate

different

credentials from different sources coming to it in an easy way, A lot of the database proxies out there will do that layer for you

and then give you a way to extract away,

here's

which username and password are connecting to this or which certificate's being used,

and then modify

the where clauses.

Say, oh, for these tables, you're gonna stripe it by customer ID.

And you'll find a lot of that in,

right, and those patterns in, like, the SaaS software forums

where

organizations are, right, having giant databases serving

thousands and thousands of customers

and using kind of those proxies to limit those. It's a very

easy way to get things like row level,

column level security out of, like, sometimes traditional tools that won't support it.

And in your own experience of working in this space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

The

creative ways that developers

will try and get around and join your data. Right? Like, they'll pull as much data into memory and join it from places you didn't expect it to be joined. And, really, there's no solving it, unless you're really looking at the level of, the queries that are coming in and being like, ain't no way you're ever gonna return more than a 100 rows on that query if you're doing this right. So, I mean, that's some of this.

And we've all seen it. Right? Devs will dev, and devs will figure out a way to get

the requirement fulfilled

against our best intentions.

So really just trying to look at kind of that audit data and having the controls in place to protect from it. That's

yeah. Sadly.

As you continue

to work in,

work in this space and work with teams, particularly

in the context of data systems, what are some of the developments

or new technologies or protocols

that you are keeping a close eye on or any of the, hopes and predictions that you have for the future of access management in the context of data?

So as much as AI promises to solve everything in the world and put us all out of jobs, right, I think Larry Ellison told us we didn't need DBAs twenty five years ago anymore, and we see how well that's gone.

I do think there is

a lot of opportunity with AI to solve a lot of the policy problems. Right? We didn't even address

the challenge that these policies are normally written in documents,

like, literally PDF files by some regulatory

group that then has to be translated through your own

corporate legal teams that then somehow hopefully gets translated into actual policies that a developer can implement,

I think we'll see better ways of doing that with AI more quickly. And as those policies change,

be able to say, here's an OPA regulation

set for HIPAA. Here's an OPA regulation set for

right,

NIST 853

compliance

or, right, we'll just say zero trust because we haven't done that yet. But, right, like, what is a zero trust data tagging format, which

is actually a thing that is built off of OpenTDF that NATO's do. Right? Where they define, okay, here's all the

data tags for the policy.

Here's all the data tags to translate to policies.

So I think watching that space is

going to be interesting. There was a project in the identity space that was trying to create a single policy

administration point.

I believe it was called

Hexagon

was the project where, right, there's one administration point, but it would put out the policies in ExactML

and Rago and CDER

that they could consistently be applied across

systems and services. So I think the rise of kinda data protection catalogs that then can be applied to datasets

It's something that I'm hoping finally catches on and actually is a thing. And we've talked about it in a lot of different ways over the last twenty, thirty years, but no one wants to direct those policies. It's tedious, and I think if we can build some tools around it and use

some of the AI

capabilities to do that, we should be more successful.

Are there any other aspects of this overall space of identity and access management and credential management in the context of data systems that we didn't discuss yet that you'd like to cover before we close out the show?

The only thing I'd like to say is

please reach out to your identity and access management teams. We keep making

our own data lakes with stale data, pulling them from sources that you all have much better access to and have already cleaned up the quality of those from the source systems

in your data lakes, your data

warehouses that

are be should be centrally managed. And

we need things like HR records.

We need things like

the contract information

so that when we onboard

a new user from an outside organization that's gonna come to your work on a temporary basis,

when that contract's supposed to expire so we can turn off their access on the last day rather than hoping and praying that the person running the team remembers to go into the system and turn it off. Same thing with training data or right? All of that information,

I see most identity teams replicating on their own.

And

a lot of times, they'll stand behind and say, well, it's in LDAP, and you guys don't speak LDAP.

That's cool. Right? Like, please reach out to the identity teams. Please

help them see the power of data and clean data, and enable them to this this is always my hook when

I have to go talk to the data teams to go, like, talk to our customers' identity teams of, like,

hey. You know what the identity teams hate? They hate getting blamed when someone can't log in to anything, but it's because of the bad data they got from the source systems.

So if you can tell them, hey, we've got this data catalog and this data warehouse that cleans all of this up, and here's the registry that says when the position code is wrong and you make a bad authorization and get blamed for it, you can put on your site, hey,

here's what we know about you. And if any of this data is wrong to you, here's the help desk or the email address to go get it fixed. Don't blame the identity team. They'll love you to death. Right? Like, as silly and stupid as that is, they're building those systems in their own silos, and it's just slowing down innovation for all of us. And,

hopefully,

at the same time, we'll be able to give some things back to the data teams of, hey. We've got these policy engines. We've got these roles defined.

Can you reuse them at the data layer? Or, hey. We're happy to work with the application teams every day to get their roles in the groups and get those managed and get them integrated with single sign on. Is there someone on your side we can turn them over to to make sure

that they're understanding how to access the data securely?

And just try and build those relationships because, right, I we're all trying to solve the rogue developer trying to get things out the door problem from the different ends of the application. If we work better together, I think we're gonna see much better results for the companies we work for. Yeah. I absolutely agree on all of those points, particularly as somebody who is responsible for both the infrastructure

and the data layer. So I I I'm at least in the, privileged position of being the person making the decisions on both sides so we can have a little bit tighter, coordination.

Well, thank you for having me today. Of course. And for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

What I'm seeing a lot of is

the gap isn't

in tools. It's not in storage.

It's not in speed.

It's in trust composition.

Right? And we can say zero trust is overloaded as shit, and I already alluded to my thoughts on that term. But

we're still missing this layer. Right? We've got catalogs

for data. We've got pipelines for transformation. We've got dashboards for insight. But there's still this missing layer of the trust of where did that data come from,

how is

it allowed to be given back to people,

and then from the other end, who are the users, have we attested them, have we attested those services,

and

how do we know the chain that they got to the data layer through.

And that to me is the biggest gap. And if we can really start identifying

what that trust composition is

from both sides,

I think some of these

right. And then that obviously comes with the wrapping in the metadata around those things. It's in these policy problems and decisions to get easier. So that's my thought.

It's a big problem to solve, but the good news is data folks tend to, like, big problems to solve

and,

take those challenges on. And please reach out to us and more than I'll provide some links to where the identity community tends to hide online.

Please reach in. This we're trying to solve the same problems and need to come together.

Alright. Well, thank you very much for taking the time today to join me and share all of your thoughts and experience on this broad and constantly shifting problem of managing

who can do what with which data, when and why.

It's definitely a very complex

problem to solve, and we've got some point solutions, but it's far from being a, holistically

managed

ecosystem. So I appreciate all of the work that you're doing to help mitigate some of those complexities, and I hope you enjoy the rest of your day.

Thanks. You as well. Take care.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com

with your story.

Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast