Synthetic Data As A Service For Simplifying Privacy Engineering With Gretel

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others.

Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies.

Sign up for the SaaS today at dataengineeringpodcast.com/accryl.

That's

acryl.

Your host is Tobias Macy. And today, I'm interviewing John Myers about privacy engineering and use cases for synthetic data and the work that he's doing at Gretl to support that. So, John, can you start by introducing yourself? Hi. My name is John Myers. I'm 1 of the cofounders and CTO here at Gretel AI.

And do you remember how you first got involved in the area of working on data? Yeah. You know, my background is largely based with, about 12 years in the Air Force. I started off in communications engineering,

working on communication systems for Space Launch Systems,

and then I switched over

into cybersecurity.

And, really, my time in cybersecurity

kind of necessitated

working with, like, massive amounts of data.

And then as I progressed into my commercial career, I built a lot of products that were, you know, founded on data, basically, data centric products.

And then that kind of evolved into the need to apply different privacy preserving techniques to that data. So you could guarantee that, you know, the products themselves weren't leaking any sensitive information. That was kind of the transition I made into, you know, where I'm at today.

In terms of the work that you're doing at Gretl, can you give a bit of an overview about what it is that you're building there and some of the story behind

you settled on that problem and why you wanted to spend your time and energy on it?

The creation of Gretl was really based off of

the combined experiences myself and my other cofounders had dealing with data and privacy. And we've all have kind of a bit of a different way that we backed into it. But prior to launching the company for about 18 months, you know, we spent a lot of nights and weekends kind of talking about the challenges we had

working with data,

being more productive with

data, being able to innovate with data.

And we also looked at, you know, other industries out there. Think about, like, finance,

codesharing.

Right? So if you look at, like, what Stripe did for payments and you look at what GitHub did for collaborating with

code, they made them engineering problems.

And they solved them by providing

toolkits

and really clean APIs for developers to more quickly work with that. And we said,

why can't we boil privacy down to an engineering problem and be able to

provide a toolkit and APIs

for developers to more quickly

create safe versions of data through a variety of means.

That's kind of like when a light bulb went on for us. We said, let's kind of go down that route and really go after the developer ecosystem

and give them a toolkit they can use to implement a variety of privacy engineering techniques.

The term privacy engineering, I'm wondering if you can give your definition of it and what you see as being

the kind of main scope and any particular areas of it that you're focusing on with Gretl.

You know, I think at a high level, I view privacy engineering as

basically

a derivative

of software engineering, data science, and data engineering

that is really focused on

the techniques and methodologies

to take data and make it safer for consumption

by a broader audience.

And specifically at Gretl, what we're focused on

is 3 large buckets

of capabilities around privacy engineering.

The first 1 being synthetic data, which is the ability to take

data, train a machine learning model on it, and then create

a version of that data that looks and feels like the real world data that is virtually indistinguishable

but is, you know, safe from a privacy perspective.

The other 2 buckets we have are

what we would call data labeling or data classification where you could quickly

analyze a dataset, determine any type of sensitive information that's inside of it. And then

moving on from that, the ability to apply

different anonymization

or de identification techniques through transformations

to that data It should be more discreet, more like more familiar ETL processes that you might have, you know, worked with in the past if you're previously a data engineer.

In the scope of an organization or a data team, who do you typically see as being responsible for privacy engineering or taking point on that issue?

I think it really varies on the organization. I have seen

it be owned by the data owners themselves. If you think about, like, a large database that holds data and you have database administrators and you have some software engineers that closely work with it, oftentimes they

are chartered with being able to create a sync version of that data when requested. I have seen organizations that have a dedicated shop that is responsible for going in and applying those privacy engineering techniques.

And

I think that kind of question and my answer to that, you know, somewhat highlights a bit of a problem where it's not always obvious who has that responsibility.

And

that's partially because there's not a lot of tools out there that enable,

you know, the developers that are closest to that data to be able to

implement those privacy engineering techniques. And that's 1 of the things that we're trying to solve here, Credel.

Because of the fact that there isn't necessarily

a

clear owner of privacy engineering as a practice. I'm curious how you see that impacting the overall adoption

in organizations and industries of these privacy engineering practices

and the kind of proper

attention and attribution of importance to it as a

first class concern?

Yeah. So I think the lack of, like, a central party that is responsible for privacy engineering

is 1 of the contributing factors to this huge timeline that has to be executed to even get access to data. So when you look at, like, trying to share data from 1 team to another or from 1 organization to another,

1 of 2 things has to happen. You have to figure out how to get the data into a safe state that you can share it, which would be, you know, applying privacy engineering techniques, or you have to go through some type of approval process so then the recipient of the data

is, you know, signed off to be able to have access to the sensitive information.

And both of those timelines are very long, and it it really hampers innovation.

And so what I think we see with our customers now is that they're interested in adopting these tools right into their software engineering and product engineering teams so that

the owner of the data or better yet, the creator of the data, which is usually, you know, an engineering team or a data science team,

has the capability to quickly

apply privacy engineering techniques and then be able to just move that data exactly where it needs to go so people aren't scrambling around trying to figure out, you know, who's responsible for actually doing it. Like, if you know who built the product, should be able to ask them to do it because now they'll have the tools to do so. In terms of the target users of Gretl, you mentioned that you're looking to be

an empowering solution to make it easier for teams to be able to adopt some of these principles and practices of privacy engineering.

I'm curious how you think about the different personas that will interact with the Gretel platform and how that informs the

feature prioritization

and the overall design of the product.

Our target user is definitely a mix of software engineers, data scientists, machine learning engineers, and researchers.

And

when

you look at that user base and you think about the way that those percenters like to adopt tools,

oftentimes, they wanna be able to kinda jump in and get started

on their own without having to kinda go through any specific onboarding process or have to, you know, give access

to, like, a license, like, going through kind of more of an enterprise sales type mission. And so the way that really informs

us on how we build

is to create a very, very low friction experience,

be able to sign up very quickly,

and offer all the features of privacy engineering

for free

for anyone that signs up. You know, they can dive in right into our console and be productive within a matter of minutes

and then have our API documentation

and our blueprints and our blogs to to show folks how to get to that next level. That has been, the approach we have taken. Really just like a self serve model so people can get productive

within minutes and then and scale however they need.

As far as the kind of general adoption, obviously, having tools such as Gretl makes it easier. But across the industry, I'm wondering what you see as maybe the

either points of friction in

highlighting this as a problem that's important enough to invest in or just the general kind of visibility and awareness of this as a problem that needs to be solved within data teams?

I think there's a couple of contributing factors there to kind of the friction points of adopting privacy engineering. 1, you know, I'll call it the the status quo or the state of the art that's out there right now

are kind of 2 ends of a spectrum. You'll see that there's a lot of tools that are built into cloud providers,

and a lot of them are built under the umbrella of what we call data loss prevention, which has kind of been in the cybersecurity realm for several years. And so you have tools

in Amazon and Google and Microsoft Azure and all the other cloud providers where they give you

some privacy tools. However, you know, they're designed to work

only with the services that those cloud providers have access to. So they're kind of like if you're really using Amazon, you get some privacy capabilities there. But if you're working at a generalized software engineering environment, you can't necessarily use those cloud provider tools unless you're really inside the ecosystem.

And then on the other end of the spectrum is you have a lot of really awesome developments

coming from research organizations,

educational centers that are developing open source tools to enable privacy engineering.

And they're all amazing

and really cutting edge, but they don't come with

enterprise support, commercial support for, you know, a really large company to adopt them at scale

that doesn't come with the

tools you need to actually run those. Right? You have their software packages, their SDKs, so you're still responsible for building an infrastructure,

figuring out how to deploy them, you know, putting a service around it. And so Gretel kinda wants to be, you know, right in the middle there. Right? Like, we're agnostic to a cloud vendor. We're agnostic to programming languages.

Return key API enabled. And so you don't have to bring your own infrastructure. You don't have to bring, you know, complex GPU systems to do the machine learning and synthetic data work. We offer it all for you. And so I think that is what is is really driving folks is that we want they don't wanna be locked into a certain cloud provider for privacy engineering tools. And 2, they want the ability for someone else to handle the infrastructure

side of it. They want the tooling to be agnostic, you know, a very specific type of ecosystem. So they have more flexibility to privacy engineering into

many different developer workflows.

Are you looking for a structured and battle tested approach for learning data engineering?

Would you like to know how you can build proper data infrastructures that are built to last?

Would you like to have a seasoned industry expert guide you and answer all of your questions?

Join Pipeline Academy, the world's 1st data engineering boot camp. Learn in small groups with like minded professionals for 9 weeks part time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer.

Plus, they have asked me anythings with world class guest speakers every week. The next cohort starts in April of 2022.

Visit dataengineeringpodcast.com/academy

today and apply now.

In terms of those workflows, I'm wondering if you can talk through some of the different stages of the data life cycle that privacy engineering

can and should be factored into,

and just some of the process of being able to actually integrate Gretel into those

inflection points of data as it's generated, manipulated, processed, incorporated into various analytics or machine learning models, etcetera?

At a general level, you know, our philosophy about how to inject privacy engineering into kind of the life cycle of data is, you know, ideally, you can get as close as to the the source data as possible to execute that because

what you run into is just amplification problem when when you have data. Right? And as soon as you create a dataset or you populate a database,

that data is gonna get exported and sent around to all sorts of different stakeholders, whether it's other engineering teams, data science teams, business teams, marketing teams. And And so that data is just gonna be replicated over and over and over again. And so if you can kind of find that choke point where that data first gets created and into the initial, you know, call it repository or data store just to use an abstract term. If you can apply privacy engineering at that point,

then you can create a safe version of that data, and then that

problem becomes less risky because you're amplifying and sharing a safe version of the data. When it comes to, like, specific Redl implementation for that, you know, for us, it's being able to kinda scale with that data creation. So

we have our core APIs, which allow you to do, you know, granular transforms on data or create a synthetic version of a dataset.

The APIs can also then be like wired into an automated tool, you know, your CICD pipelines.

We're building a suite of connectors

so you can grab data when it lands in an S3 bucket and immediately create a safe version of that data and put it into another S3 bucket.

And so being able to turn us on into in a data pipeline style like that is the way that we typically integrate with our customers.

Can you describe a bit more about how the Gretl platform itself is implemented and some of the architectural and design elements that you had to consider as you were starting to build out the platform and onboard some of your early design partners?

I'll kinda start with the current state, and then I'll kinda talk a little bit how we got there. So the current state is that we offer,

you know, under the hood, Gretl is

a cloud service that is backed by REST APIs. And those REST APIs let you schedule jobs, and those jobs run, consume, and input data,

execute the privacy engineering tasks, whether we're doing synthetic data generation,

transforms, data classification, and then they output that data. All of that can be scheduled

and achieved

purely with our rest APIs, and then we have different interface layers in front of it. So we have our console, which is, you know, our flagship. You log in to Gretl, you get started. That's our user interface. It uses those REST APIs for you. We have a CLI tool that you can drop into the command line and you can start to build scripts and you can kinda get more automation. And then we have that language specific SDKs that allow you to kind of then bake us into, you know, whatever software applications you have. But under the hood, you know, the big architectural decision we wanted us to make was that the API itself is what drives everything,

and then we can build tooling around it for different specific types of frameworks and ecosystems.

How we ended up there

is really kind of our evolution through 2 public betas that we had before we went GA about a month ago.

So our first beta release, you know, we implemented

and we provided

SDKs

that, you know, did a lot of the synthetic data

generation

transforms.

And, you know, we put them out there in blueprints, and we allowed people to download those SDKs and integrate them into the workflows.

You know, however, you know, they were offered in Python. You had to put them inside of, like, Jupyter Notebooks. And so you out of the gate for that first beta, we had, you know, a ton of design partners come on and and they they loved it. But what we learned is we kind of isolated

consumption of Gretel to a a narrower audience than we wanted.

You know, as we start talking to other design partners and they're saying, like, well, hey. We're Java shop, and we're building x, y, z in Java, or we're a Golang shop, and we're building this. And they all have different circumstances. And so when we took a step back, we said, can we add a layer of abstraction in this REST API so that everything can go through that? And then it doesn't matter what language you write it in. It doesn't matter what framework you're on. You can integrate Gretl anywhere. That's how we got to where we're at

today. In terms of the

kind

of nuance of integrating

data obfuscation and synthetic data into your datasets for analysis or machine learning or

just being able to create a safe version of a dataset to share with other organizations or publicly?

What are some of the kind of details and nuances

that you typically see data engineers or data analysts running up against and just some of the

considerations that they need to be aware of as they start to use something like Gretl? Yeah. I think 1 of the biggest nuances is

this kind of utility versus privacy,

I'll call it trade off or relationship.

So look at synthetic data.

Our synthetic data capability

comes with privacy capabilities baked in, and you can kind of tune

the privacy settings

based off what your use case is.

But

if you let's say you put very high privacy settings on, that might impact the utility of the data because

we do certain things like

whoever these similar records to the training data or we remove outliers that can potentially be used to infer

information about someone.

And so, really,

what we kinda talk about with our customers is

where is this data going and who's the consumer of the data? Let's say, for example,

you are replicating a database

because you need to have mock data for an engineering team and you know that a lot of engineers you're gonna consume have access to that data,

then we would say, well, why don't you just turn on the privacy level as high as it can go? And then you'll have a reasonable

amount of confidence that you could just make that data generally available to a lot of those developers

because they're really looking to kind of just build let's say you're building a user interface around that data. You just want the data to fit in the user interface, and you wanna make sure that your,

components and your interactions

can be built the way they need to be built, so by having access to that data. So that's, like, 1 example. On the flip side,

know, let's say you are sharing data

with a data science team and they need to train a machine learning model on that data.

You know, you have to kinda pay attention to the different configuration parameters to make sure that

you don't lose

a lot of the correlations

and statistical properties of that dataset. And if you know that data isn't going to be

necessarily,

like, shared publicly or shared very widely, then you can make some trade offs and preserve a lot of utility in the data

and then adjust your privacy settings that way. So it really comes down to those 2 kind of, like, really big knobs, and there's a lot of micro adjustments in the configuration you can make to account for those use cases.

And as far as being able to

maintain some of that statistical relevance as you are

modifying or manipulating the source datasets,

what are some of the ways that you have

worked in sort of strategies to maybe simplify that process for end users so that they don't have

to necessarily be as familiar with the state of the art of those statistical methods to figure out what is the sort of appropriate

sampling and distribution of the source datasets and the, you know, acceptable mutations that can be made and just making it as sort of plug and play as possible while giving them, you know, an appropriate level of control?

There's kind of, like, 2 sides to that. So on the configuration side, before you run, like, a job in Gretl,

we have some high level knobs that you can turn. Right? We have different privacy settings, and you can just say you can turn them off. You can put them on medium. You can put them on high. And we we try to kind of stick with those

high level simple settings so people can get started very easily.

And then as you kind of iterate and and work with the data, you can drill down and start to, like, even get into hyperparameter tuning for some of the underlying models, which we start to see customers do when they really wanna, like, hone in on certain attributes that they want the data to have. And then on the output side, 1 of, like, my favorite products that we have here is this notion of a quality report, synthetic quality report. And what that does is that looks at the original data and looks at the synthetic data

and basically gives you, you know, some high level scores

about those distributions, about those statistical properties, about kind of, like, the deep structures

with the inside of that data.

And then we'll tell you, like, hey. Like, your score is excellent, and therefore, you know, you can use this synthetic data for machine learning. Right? Like, you can actually train a machine learning model on this and then go put that machine learning model and make predictions on the real world data and have the same, if not better accuracy than you would have by just building a machine learning model on the original data.

So what you can do is you can kind of

run experiments and tweak different configuration parameters, and then you can look at how that score changes for you. Right? And that score will show you, like, what the privacy levels are, and it'll show you what the utility levels are. So we've definitely used some other tools like weights and biases out there as an awesome product we've integrated

into some of our blueprints where we can show how to do, like, parameter sweeps to be able to show how you can get the right optimizations

for what your desired output is. And so you basically can just watch those scores change based off of your input. And then once you kinda find that sweet spot

of what you want, that's where it will say, okay. Now you can kind

of start to automate us into your workflow because you know what those parameters are, and you can turn us on kind of like an always on state to process your data.

For the machine learning use case, what have you seen as the

kind of utility of just using purely synthetic training datasets rather than having to

work from the point of having collected a set of data and just using purely synthetic data as a way to kind of bootstrap your ML operations

in the case that maybe you're a new business or you're trying to

develop a new model or product line, and you don't necessarily have enough

raw information to be able to actually do anything

substantial on the ML front? That's, like, probably 1 of the most innovative things that we see our customers doing is kind of solving that, you know, quote, unquote cold start problem

where they need data. They just they don't have any to start with. Or they have very little data, and they wanna amplify that so they can actually have enough to test variety of machine learning models.

So a lot of times, we have customers coming to us with, like, some sample data.

They train the model on it, and they use that model to generate, you know, 10 x, 20 x, 30 x the amount so they can actually experiment with different machine learning techniques for their use case. And sometimes, they do have enough data, but, you know, what they wanna do is it's like t 1 has data.

They need to get it over to t 2 so they can train a machine learning model on it. Really, that's like a privacy gating problem. Right? I just I can't give you access to the sensitive information, and maybe it's a third party data science team. So you can create a synthetic version of that data.

And then what you can do is then just give a synthetic version to that team, let them train a machine learning model on it, bring it back in. So that's, like, the second use case. Right? The first 1, the cold start problem. The second 1, like, the privacy sharing problem. And the third 1, which is really neat, is that sometimes you just don't have enough of certain data within your data. Right? So if you have you know, say you're predicting fraud or we have a ton of medical use cases here where there's just not enough of a certain data class that you're trying to predict on, you could actually boost

certain types

of data within our synthetic models. So we call that the conditioning or smart seeding capability

where you can train the model and you can say, hey, I only want you to synthesize records

that have

these partial values, and then we'll synthesize the rest of the record for you so you can actually

boost or balance,

your dataset.

In terms of the machine learning use case, another aspect of this that's interesting to explore is

the question of bias within the generated datasets because that's always a problem when working with machine learning or analytics is the question of

are you incorporating any sort of

statistical biases

into your results based on the data collection that you've done? And I'm curious how you see people approaching that problem and potentially mitigating it when they're working with synthetically generated datasets,

whether that's in terms of the types of demographic information that you're producing so that you have an appropriate range of distribution of ages or

locations, etcetera,

and just that overall question of

artificial bias in synthetic datasets?

That is 1 of the biggest use cases we see for the third thing I just mentioned where you can use our abilities to kind of

define part of the data that you wanna see generated.

So

you wanna balance out, you know, let's say,

an imbalance in in gender or race or ethnicity or ages, and then that'll help your machine learning models actually perform better.

We had a blog post previously where we took an open dataset about, like, it was, like, heart disease prediction,

and there are twice as many male records as there were female records. And so that naturally added bias to the machine learning model because it was much better at predicting

male heart disease and female heart disease.

And so we went through, and we

synthesized

more female records. We balanced out the dataset, and then what we found is that we had better performance across the board on both male prediction and female prediction in that use case. So we did that in our beta 1, and we're actually re releasing that with our new APIs here shortly.

The other interesting element

of being able to

synthesize

or obfuscate data is the question of being able to proactively

identify

potentially sensitive attributes in a given record or set of records and

how you have built up your heuristics or models to be able to

help your customers understand

which attributes they do want to modify or mutate in their privacy engineering workflows.

When I talk about, like, privacy risks, you know, for

lack of better terms, I'll call, like, discrete

privacy risks. Like you mentioned, there's just identifiers that are very obvious. Right? Social security numbers, email addresses, phone numbers, addresses. And then there are these aggregate

risks that can be used to essentially reidentify someone. Right? There are plenty of studies out there that show, like, you know, a ZIP code plus an age, you know, could be re identifiable, you know, down to, like, just 1 or 2 people in a large dataset.

And so a lot of the things that we've incorporated

will help identify some of those risk and let you mitigate them by adding, you know, at the synthetic data level, that

is very good for solving that because it will create records that, you know, can't be reidentified

back to independent users because those values are synthetic and generated and, like, the distributions

and the correlation still hold,

but we're not replaying any of those attributes at a record level. Right? So, you know, just for argument's sake, I'm the

the only 1 in in Philadelphia,

like, that's in our company.

And

that also shows as an outlier in the data too. And sometimes you just have to remove those. And so our synthetic data generation capability will also remove some of those outliers because

reasonably you're not gonna synthesize, you know, even if you synthesize a bunch of data about someone in Philadelphia, then it's still gonna be me. So we'll actually remove those from the dataset. And then sometimes there's enough samples in an area where you just wanna make sure that your ages are modified enough or your your whatever, your specific locations are modified enough. And so Synthetics

takes care a lot of that automatically.

Sometimes you need a more granular approach to do that. And so what we do in our transform capability is we actually have the ability to kind of, you know, abstract ages into, maybe, see, buckets of ranges

or put dates in the buckets of dates. And so you still get value out of the dataset. It's just like you remove the discrete identifiers there. So, yeah, those re identification

risks are, know, some of the ones that are a lot more subtle. And so those are the ones that a lot of our customers really want to have us take care of for them. There are definitely some parallels between privacy engineering in the data space and security engineering in the application development and infrastructure space.

1 of the practices that's common there is the idea of doing, you know, game days where you have red teams and blue teams where 1 person is trying to attack a piece of infrastructure or an application,

and the other team is trying to defend against it. And

Gretel is definitely a useful

attack oriented planning for being able to maybe use

generative adversarial networks to automate some of these re identification

techniques and figure out how well you're able to defend against them with these synthetic datasets or just that overall practice of sort of defense in-depth or offensive tactics in the privacy engineering space? That's a really cool corollary to kind of, like, the red team, blue team approach in cybersecurity.

You know, internally, you know, the ML team here has used quite a few frameworks to kind of, you know, quote unquote attack data, and then we have used those to help actually influence how we build some of the filters that we put in place to then mitigate those attacks. So when it comes from a product development standpoint, we'll be able to take a lot of sample datasets,

run various adversarial techniques against it, you know, determine

that there are ways to reidentify stuff. And then, you know, we have built our filters to then

remove that risk and then rerun the attack and then show that that risk is mitigated. We're not necessarily building those attack frameworks for public consumption, but we definitely use them as part of our, like, product engineering

process.

It's definitely interesting to see how sort of the overall data

there's actually some more specific sort of nuance to it, what are the cases where there's actually some more specific

sort of nuance to it in the data space and being able to manage those translations back and forth. You know, I think what would be really cool to start to see is, like, given, like, CTFs. You know, what does a data privacy CTF look like? A capture to flag. You know, how could you use that to almost, like, you know, like, crowdsource certain types of techniques out there? Also, taking it a bit further, the question of CVEs or what are some of the known vulnerabilities

in datasets

or

particular data modeling approaches and how to mitigate them as you prepare your datasets for public consumption?

Yeah. That's something we've talked about. Like,

the CVE is

1 angle for sure and having

spent you know, a lot of us, especially the founders who spend a lot of time in the government and, like, you know, in the government,

you have different levels of, like, data classification. Right? Confidential, the secret, the top secret, and there's all the different caveats you can attach on to those. What is that for data privacy? Right? And that's something we've been thinking about. Like, how do you put some type of, like, classification, which is a very overused term, so it can mean a lot of things. So when I mean a classification, I mean, like, something, you know, not necessarily confidential, top secret, but how do you put a label, so to speak, on a dataset that tells you what, like, the reasonable

shareability is of it. Right? And that's something we're trying to iterate because there's not a lot of that out there, if anything. Right? And so how do you just put a simple stamp on a dataset or a database that says, yep. This has been processed by Gretl,

and it is certified that it could be released at this certain level of consumption with these guarantees.

To that point of

safety of release, there's also the time component to it where maybe a certain dataset is only

because

all of the information is no longer pertinent or sensitive. You know, for instance, if you have, like, a set of a dataset related to

travel plans where maybe you're working with a travel agency and you don't want to release information about when people are going to be away from their homes, you know, in the future. But after a certain period, that's no longer valid, and so you don't have to worry about that aspect of it. I mean, there are certainly other considerations to that of, you know, knowing what are people's travel patterns for different reasons. But the question of timeliness in data security is also maybe an interesting angle to it. Yeah. That's really interesting to think about how

that

data becomes

less of a risk just with the factor of time. Right? In theory, like, you might not have to do anything. You just might say, hey. Time's out. It's something that's in the past. I think kind of related to that, then you also have to think about how does a dataset change, especially when so many companies are working with streaming data. Even another dimension in the dataset gets added into that stream.

It could drastically change the privacy risk of it when you think about, like, you know, reidentification

attacks. Right? You just add a new dimension that all of a sudden becomes, you know, re attributable back to someone because you're we just added in an age column or we just added in

a city column.

You might have to completely rethink about how you process that data from a safety perspective. And so how do you detect those changes over time? Something that, like, our classification tool is really good for, and we've had a couple of customers kinda refer to it as, like, a data firewall where it's like,

hey, like, to use the security analogy, like, you might freak out if all of a sudden you see a port opened up that shouldn't be opened up. But what about you see, like, a new data element or dimension get added that you didn't know was there? How does that change it? And so how can you say, hey, this slipped into the data stream.

How do we how do we change our privacy posture?

Do you want to learn how the Joybird data team reduced their time spent building new integrations and managing data pipelines by 93%?

Join a live webinar hosted by RudderStack on April 20th, where Joybird's director of analytics, Brett Chawney, will walk you through how retooling their data stack with RudderStack, Snowflake, and Iterable made this possible.

Go to rudderstack.com/joybird

to register today.

Another interesting aspect to what you're saying about the classifications

with the analogy to confidential

or top secret or sensitive

is maybe being able to use the lineage information

that a lot of people are tracking in their data processing pipelines to understand

what are the kind of terminal nodes in this lineage graph so that I know who are the consumers of this dataset and then being able to propagate that backwards to the earlier stages of the pipeline to say, okay. This piece of information is going to be consumed publicly.

This is the inflection point at which it starts along that path, and so this is where I need to know to apply these obfuscation techniques to these records to make sure that I don't have to worry about it further down the line. That's really interesting. It also kind of reminds me of, like there's also, like, the joinability

problem too where

the data set on its own, you might have a reasonable

expectation that it's safe. But, you know, there's entire companies that build their entire business model on, like, aggregating data from different sources and then doing joins on it. Then all of a sudden, you've taken, you know, maybe 3 or 4 or 5 different datasets that are reasonably private to begin with, but then the rejoining of them becomes

changes, quote, unquote, the classification level. Right? Again, that's another kind of corollary from

government where

you have

individual pieces, maybe you're all unclassified, but you combine a couple of unclassified things together and now you have a secret you have, like, a secret level problem. So how can you break that joinability

earlier is another kind of, like, challenging problem to to think through. Absolutely.

In your work of building Gretel and working in the space of privacy engineering

and helping your customers

mitigate some of the potential attacks on their datasets, what are some of the most interesting or innovative or unexpected ways that you've seen Gretel applied?

I think

kind of revisiting some of the stuff we mentioned before.

Some of the cold start problem have been really interesting where

you look at,

let's say,

a company that wanted to move into a different market from a sales perspective and they wanted to kind of model, like, hey. What would the sales

information look like in this market?

It's almost like a extension of,

like, transfer learning or what have you where they're like, okay. Well, can we take information we do have and then use that to synthesize what we think

sales information in this other region would look like and then use that to kind of, like, bootstrap

a new effort. That has been 1 of the really interesting things from us. And I, you know, I think the cold start problem was 1

we didn't maybe think was going to be like a really popular thing immediately. Right? We were always in the mindset that like, oh, I just have data set a and I wanna synthesize a version of it so it's safe to share. But that 1 has cropped up more and more. Like, you know, how how could we be a replacement for something like you know, there's, like, libraries like Faker out there. There's, like, tools like Makaroo, which give you fake datasets, but they don't necessarily come with trends baked in. Right? Like, hey. I wanna know what, like, soccer ball sales in New England look like. What if we just had a synthetic model that was already trained on something like that and you could generate

data for that. And then I really wanna know what it looks like in the months of, like, you know, end of the summer before school starts because everyone's going back to training camps. And then they, well, I can now I've already conditioned that model to generate data based off of, you know, a date so that I can just only focus in on that. And I already know that it's trained on real data, so it's gonna be statistically relevant data. I don't have to just, like, use random number generators to create that. So that's been 1 of the coolest things I think that we have seen. In your own experience

of building the company and working on the technology, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think 1 of the most challenging things is we're always on our toes about the different types of data we see. When you build something, you kinda build this, like, corpus of, I'll call it, like, fixtures. Right? So use a very specific, like, testing term.

And you go through and you test on, you know, you know, maybe we have, like, 500 to 2 1, 000 different datasets. We test everything on and we're like, oh, cool. Like, everything everything works. We think we have a variety of data.

And then, you know, every day, someone's like, you know, I export us an HL 7 data until, like, really gnarly. That's the JSON, and, like, I wanna process it. And we're like, oh, that's a new 1. We have to kinda figure out how to work with that person to kinda pre process the data and and put it through the system. And so I think we're always gonna be thinking about

the shape of data and how to process different shapes of data. That's 1 of the biggest ones,

like, at a granular level. And I think just the consumption and packaging is 1 that's always top of mind. Right? It's very complex to take a complex thing and make it simple to use. And so we're always trying to say, like, how do we go from, like, 5 keystrokes, 3 keystrokes to 1 keystroke? You know, long term, we wanna see Gretl in the hands of non developers alike too. Right? Like, how do we plug into a Google Sheets? How do we plug into Airtable? How do we plug in? So

I just wanna, like as simply as I can go into Google Sheets and share it with you, can I also route it through something like Gretel and get it to you? Right? And so now we're pushing up a consumer stack, and I think that's something that's gonna be top of mind in a very complex problem, but I think a very fun problem to solve.

For people who are interested in starting to incorporate privacy engineering

and data obfuscation tactics into their workflows. What are the cases where Gretl is the wrong choice and they're better suited either building something in house or using some open source framework or a different commercial vendor? I think there's a couple of places where Gretel is the wrong choice. 1 on the ethical side. You know, someone's, like, looking to exploit data and they're using, like, hey. Can I use Gretel to find

ways to, like, violate privacy almost? Like, I'm interested, like, oh, is this the data set I want because it contains PII that might use to, like, target customers in an incorrect way. You know, those are decisions we make as a company not to, like, have them as customers.

On an engineering level,

reversibility,

that's not something we're really super interested in. Right? When you synthesize a dataset

and you create a synthetic model and you generate data from that model and you're like, well, I need to actually revert this back to where it was. Like, that's kind of a 1 way door. Right? And if you're looking to do that, then, you know of course, you could stay in Gretel and use some of our transforms the

data

the data and select it. Right. There are tools out there like you see like in legal documents where like you redact certain parts, but only certain people have permissions to come in and look. That's not what we're really in the business of. Really what we wanna do is like, accelerate

the way that you can share data and then use that data. So I would, you know, I'd probably recommend other tools when when it comes to, like, selective unmasking of of information.

As you continue to build out the Gretl platform and explore the space of privacy engineering

and help your customers

accelerate that ability to build and share datasets, what are the some of the things you have planned for the near to medium term or any areas that you're particularly excited to dig into?

Yeah. I think 1 of the areas that I'm pretty excited to dig into is how do we kind of put this

collaboration

ecosystem around data, you know, like, same way GitHub did for Git. Right? When I think of GitHub, I don't think of Git per se. You know, it's, like, the prefix for the whole company's name. I think about, like, the ability

to collaborate on code, like, through a rich pull request experience.

I see, you know, a way to manage releases, a way to manage,

you know, CICD

now with GitHub actions.

So how do we kinda put that collaboration around data to make collaborating on data, like, fun and, like, actual, like, productive?

And so for us, we kinda see this, call it a marketplace, call it a place to exchange models, exchange data, but knowing that, like, can I just sign up for Gretl? And someone

has, you know, created a model that is data that I want. Can I just join their project and generate modeling data for that? Right? That's what I'm really excited about building is that ability to share data very quickly and, like, have fun doing it. And, you know, for us, we view

Gretl as the processor of the data and then the facilitator to actually collaborate with that data. If I think about platforms like Kaggle, right, where it's, like, plumbing and experimenting and learning with machine learning, and then there's the Kaggle competitions that are there. You don't necessarily have guarantees about the data that's there, though. Right? We had a blog post a long time ago. We found some old data from it was, like, bank loan information, and there was PII, like,

buried way deep in the dataset. Right? Like, there was no guarantee that the data that was there was safe to share. So, of course, they took it down once we kind of alerted them to it. But, you know, can we have a collaborative ecosystem like that where there's, like, a certain level of safety when you're there? Because, like, working with data is scary. Right? You're like, oh, someone emailed me this dataset and, like, open it. You're, like, crossing your fingers. You don't see Social Security numbers staring at you or something like that. How do we remove that fear and just make it fun to work and share with data and, like, just lower that barrier?

Yeah. It's definitely an interesting aspect to the problem. Yeah.

Yeah. I mean, like, I'm, like, still like, I have spent years working in the government and, like, sensitive information, and you're always just, like, holy crap. What's on my hard disk? Which I don't know about. Right? But, like, if I download the data from Gretl and I wanna work with, like, forecasting data or, like, technical data, telemetry data, if I download it from Gretel. Like, I could have, like, peace of mind at downloading something I shouldn't have. Right. The worst case scenario in working in a government was, like, the FBI is knocking at your door. It was, like, hey. We have a record of you downloading

You shouldn't have downloaded it. It's like, what's our version of that? It's like, you don't want, like, your risk officer kinda knocking on your door. It was like, hey. Like, we exported a database dump that, like,

went out to way too many people, and now we have to, like, take your laptop. So, like,

that's kinda where I see us going. Yeah. That's definitely an interesting dimension to the problem where most people are thinking about, you know, what is the risk of producing a dataset that has information that I don't want made available. But, yeah, there is the other side of it of accidentally coming into possession of data that you shouldn't have, and what are the risks associated with that? Yeah. Yeah. I mean, it's the simplest things. Like, I have, like, more junior engineers. Not at our company, but, like, places I talked to. And I was just even talking to my brother who is a developer

and talking about, like, access to datasets to learn other things. Right? Or, like, hey. Like, how do I get into that analytics? Like, I really wanna be able to, like, get really good at SQL and stuff like that. It's like, what are good datasets for that? I'm like, a lot of good datasets. You probably shouldn't have access to them. But it'd be cool if you go download a dataset that's like, here's really good information for x y z. Go get it from Gretl, and it's safe. Yeah. Be curious to see how that plays out. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Well, I think, for 1, I think there's a gap in the tooling period,

which is I think, you know, what we're really trying to fill in.

And 2, I think I'd say it, you know, maybe I think there's a gap in, like, a good developer experience when using tools, especially for privacy engineering. They're not

always the easiest to get up and running with. A while ago, I talked about that spectrum where it's like you have the big, like data loss prevention type tools and the cloud providers, or you have kind of these like new and upcoming, like, open source tools that are out there. And

sometimes you just wanna have, like, a really good developer experience. And, you know, I reference companies like GitHub and Stripe, and they have excellent developer experience. Like, you know, I've built with all of them. I've used API, some other providers out there like WorkOS for authentication and Auth0 for error handling. And, like, you just kinda read through their docs and you're just working with them and you just feel really productive when

when you're working with them. You're like, wow. I built something, like, that was really hard, but, like, this API made it really easy.

I think that developer experience is

pretty much missing from privacy engineering. Right? And that's, like, you know, when it comes

to building the product, that's 1 of the things that, like, I want to make sure other developers, like, just, like, have fun using Gretel. Like, they feel pretty tough. Absolutely.

Well, thank you for taking the time today to join me and share the work that you're doing at Gretl and your overall perspective on the privacy engineering space. It's actually a very

interesting and complicated important problem domain. So I appreciate the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. I had a blast, and, you know, I really appreciate you having me on the show.

Listening.

Don't forget to check out our other show, podcast dotinit@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links