The Benefits And Challenges Of Building A Data Trust

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances, and they've got GPU instances as well.

Go to data engineering podcast dot com /linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities

to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Pareteum Global Intelligence,

ODSC,

and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data Conference, and PyCon US.

Go to data engineering podcast.com/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy, and today, I'm interviewing Tom Plagg and Greg Mundy about BrightHive, a platform for building data trusts. So, Tom, can you start by introducing yourself?

Sure. Hi. My name is Tom Claguey. I,

came to BrightHive,

around about the time it formed, actually. Myself and the CEO, Matt Gee,

had been working together at the University of Chicago and then had been doing some consulting work.

And, so as we were

creating BrightHive, I was working for another data science startup and was convinced

by, by Matt and by the vision that you put forward to come over and lead the product and engineering team here. And, Greg, how about yourself? Sure. Hi. I'm Greg Mundy.

I come to BrightHive,

after previously working as a government contractor,

for a local company here in West Virginia,

for a few years,

most of the work that I've done was actually focused in data management.

So

I got involved with BrightHive mainly from some of the work that they did out of the workforce,

data initiative in Chicago.

So it was really nice to come to a company that I could leverage a lot of the experience that I had for

data management,

building

data pipelines, data systems

in a very, very

interesting company. And, Tom, do you remember how you first got involved in the area of data management? Sure. So I came to Chicago, where I currently live,

in about 2009,

and I was, I was actually a a postdoc in astrophysics.

So I was working with,

pretty decent volumes of data on the experiments that, that we were working on. I completed my postdoc

and,

afterwards,

started looking at faculty jobs and decided that wasn't the route that I wanted to take. So, I I kinda took advantage of the community that existed at the University of Chicago

around,

data science and and the social sector. There was a center forming at the University of Chicago as well as a fellowship program,

called Data Science for Social Good that was spinning up just as my postdoc was ending. And,

through discussions with the folks who were setting those those new entities up, I kinda realized that that was the route that I wanted to take. And and, of course, coming from a academic

background, the skill sets and the technologies are are a little bit different, but you do kind of get a broad exposure to a lot of different ideas. And so I had some some learning to do,

as I switched over,

both about the technology and about the needs of the organizations that, that we were working with, largely nonprofits

and,

government agencies. But that's that's kinda how I got started. We, we started trying to do pure data science machine learning

work, and then and from there, realized that a lot of these social sector agencies actually had needs in data management

and data engineering just as much as they did analytics, if not more so.

And so I I started learning about the field and and

and building my expertise there. And, Greg, do you remember how you first got involved in the area of data management? Yeah. I think

I I like to say I came into the field somewhat by accident.

In grad school, I spent a lot of time

doing,

well, data mining at the time, So a lot of knowledge engineering, machine learning. Then when I

got into industry,

I I actually spent some time as faculty for a few years. And then I got into a company where we were doing a lot of work with data from NASA and NOAA.

Then eventually I shifted in that same company over to a project that was doing large scale data archival for,

NOAA scientific datasets.

So

after spending a few years in that sector

and really

developing a lot of the skills

of, you know, managing

these large scale data systems, you know, being able to actually

extract useful data from these systems in a timely fashion. It dawned on me after a while that, you know, the social good of, you know, being able to take these large volumes of data and actually

find ways of doing something

that benefits society with it. That's really the thing that

got me even more interested in the field.

So when the opportunity to do work,

through data science for social good, with the workforce data initiative came along as a voluntary thing, I jumped on board because it was more of using the skills that I developed from working on NASA and NOAA data to actually

do more in the social sector.

And so

at the opening, I mentioned that BrightHive is a platform for building data trusts. And before we get too much into BrightHive specifically, I'm wondering if you can give a bit of a description about what a data trust is and some of the reasons that an organization might want to build 1. Sure. A data trust is a so it's a form that a data collaborative can take, and, it's a particular form that has a governance structure attached to it, a trustee, who's generally a, a party who all sides who are sharing data with each other

uniformly

have faith in as a as a steward of the data, and it also has a technology backbone as well so that all

data trust is really appropriate

for multiparty data sharing. It's not really for

internal

collaboration within an enterprise. I think that that problem has largely

been addressed by other by other actors in the industry. It's it's when you start to go across agency lines or across public private lines

that you really need a lot of the infrastructure and

around, you know, data sharing agreements

and

and very strict logging and controls over who's using the the data that you are exposing for what purposes.

You you need to draw a lot more lines

and,

and introduce

a little bit of a different type of technological solution.

And and it's 1 that we recognized

at BrightHive pretty early on was was kind of a missing piece in the social sector in particular.

A lot of data sharing agreements that are signed between governments and and private industry or nonprofits are kind of 1 offs,

generally used for piloting or for,

developing 1 particular

set of metrics

for reporting,

and then it goes away.

And, the next,

agency head who comes in kinda has to reinvent it, start over from scratch. So we realized early on when we were working with these types of entities that something more sustainable

and

extensible

would really help the field move forward and and help build a a true social sector data infrastructure where people were collaborating,

to serve the to serve the people who needed their help. And so it was like the match between this particular,

form of data collaborative, which grew out of, actually, the intelligence community,

and what we saw as the needs of the social sector that that really was the impetus behind BrightHive in the first place.

And that brings us to BrightHive specifically. I'm wondering if you can talk a bit more about what BrightHive is both as an organization and from the platform perspective?

As an organization,

we are a we are a public benefit corporation.

We started out,

as a consultancy called Impact Lab, and that in turn formed out of the data science for social good fellowship that was hosted at the University Chicago. That fellowship was designed to put largely grad students,

in

computer science and engineering and hard sciences and the social sciences,

to work on machine learning problems that were facing governments and nonprofits.

And certain

types of problems are amenable to a 3 month fellowship program.

Many are not.

So as the fellowship

grew and expanded to a second, 3rd year, we we saw the need to take the kinds of problems that weren't amenable to that sort of, you know, drop in, drop drop out type structure

and put a little bit more sustained effort into them, so we created the consultancy, and Mekki and myself and as as well as Andrew Means were the principals. And then the 3 of us spent about a year

fleshing out what we saw as the the needs of the sector and then moved from a consultancy to a product company. That was when we created BrightHive and absorbed our old consultancy into it. And

so, it's always been you know, deep in our DNA

is the idea of building data infrastructure with a social purpose. Right? That's how we got into this. That's how the 3 of us started working together. It's right there in our in our company charter. Is data collaboration

is good overall, and more people should do it, but we're particularly interested in solving the problems which address the needs of, let's say, low income individuals who are who are trying to get education and jobs. We've we've worked on problems related to homelessness

and other, I'd say, broadly conceived

social mission aligned work, and we definitely continue to do so. That that's really who we are. And you mentioned that a data trust is a specific instance of what you categorize more broadly as a data collaborative. I'm wondering if you can just talk a bit more about some of the other forms that that might take. Yeah. I would say, you know, if you're thinking about multiparty data collaboratives,

a lot of times, the form that takes, and, Western Pennsylvania is a great example of this,

is is 1 particular agency or oftentimes academic

group

setting up a giant data warehouse and negotiating

bilateral data sharing agreements with all the parties who agree to pool their data,

and then they host the single data warehouse.

They control access to it, and that's actually a really good model if you can get everybody to agree to it and if there's sustaining funding available and all sorts of other you know, many things can go wrong, but, but certainly, it's worked quite well in Western Pennsylvania. I would say the thing that makes a data trust a bit different is even though there is a trustee and a centralized governance

function within a data trust, it's really intended to be multilateral rather than a set of bilateral agreements.

The idea is that

you have n members of your data trust, let's say, a department of higher education, a workforce,

workforce board,

maybe some large employers.

Each of them

exposes to the others the metadata about the datasets that they're willing to share. Each of them can ask for or publish

particular data sets

to other members, not necessarily all other members, and the members together can decide on metrics and calculations that can be performed

on their administrative data to produce, let's say, aggregate metrics or anonymized datasets for use by analysts,

which truly do pool their data, deduplicate, create new entities,

and, you know, produce a data product that's really owned by the trust collectively.

So that's the distinction that we draw. We don't really have a centralized warehousing

structure. We really more have

peer to peer

API based,

data sharing and

a set of microservices

that, that control access to it

and log and and manage all of the, the compliance requirements, which can sometimes be quite extensive.

So I think that's what makes it different

than, than a data collaborative

of the sort that, let's say, here in Chicago,

exists

around the data from the Chicago Public School System. That's managed by Chapin Hall, and and they are in charge of it, and everybody works through them. Our goal is not to make that obsolete, but to allow for situations where there isn't an entity that can serve that role to still have a way that they can practically,

work together using data. And 1 of the things that you mentioned there that I'm interested in digging more into is this idea of the ownership of the derivative datasets

or aggregate information

about the different entities contained within the data owned by the different members of the trust and some of the complications that arise in terms of where the intellectual property would lie as far as any algorithms or derivative data products that come out of the,

information that's available in this trust? Yeah. Let me break that into 2 parts. 1st, the algorithms.

Because of the space that we're operating in, we take a pretty opinionated stance that especially if public sector is involved, the algorithms themselves that are being used to create, let's say, metrics

or

scores or any sort of other derivative data products, even if the underlying data for privacy or intellectual property reasons is

kept under lock and key, that the algorithms

themselves ought to really be made public if at all possible. So our stance on this is, you know, publisher code, basically, but, you know, not that's that's not going to work in every circumstance. There's there's obviously exceptions to every rule, but it's our, it's our conviction that that's kind of the right path forward for, social sector organizations, for philanthropy,

and, for public public organizations.

But in terms of the data itself, I think you raise probably the single most important point. And,

in in a sense, the purpose the data trust is to create and and have every party sign a data trust agreement, which which strictly lays those conditions out and also sets up a structure where each of the organizations sends in a representative

and talks through things like, oh, we'd like to create this new combined dataset, or we'd like

to publish this particular metric. It requires data from you, you, and you. Let's, let's all work together and figure out how to do it, and then the, the data product is then owned by the trust and with the trustee as the custodian of it. And, you know, it it's a structure that, I think, is not, at least at this 0.1

that runs on autopilot. So BrightHive, in addition to having a product and engineering team, has a pretty strong services component that helps organizations think this all through, work through the data trust agreement and the legalities, clients aspects, to get to the point where they're able to make those decisions collectively and implement them in code. It may be 1 day when everybody knows what a data trust is and there's a 100 templates to work from. It will be possible

for a data trust or for a group of of agencies to come together and build this type of collaborative without some handholding, some guidance, but we're not there yet. And so each of the data trusts that we're working with has what we call a data trust lead assigned to it, whose job it is to help make those connections

and to talk with both executive level folks as well as the technological

the IT teams and kind of bring this to fruition. Because

in addition to just the practicalities of how do I calculate this metric or how do I implement this scoring algorithm, you have to get buy in from everybody. You have to get everybody to sign off that the intellectual property is is, is owned by this new thing, and that's, that's a very nontrivial task. And for an existing data trust that's already been established, have you found that there

are general approaches

to how an individual or an organization might gain access to be either a member of the trust or be able to have some limited access to the data contained therein to be able to do some sort of analysis or build additional products on top of it? Yeah. Absolutely. You know, 1 of the data trusts actually, a a couple of them that we have, came together explicitly in order to allow a 3rd party software developer to implement an application that uses the data,

within the trust. And in that case, it's a matter of basically standing up the OAuth service that,

and creating,

API keys and making sure that that particular third party developer has access to the data, but also that every access is logged and that a full audit trail is kept. And that's kind of what we designed our initial version of this around was the scenario where the main user of the data within a data trust was to be 3rd party app developers or 3rd party analysts who would be accessing the data via API. I think what we've come to appreciate is that there's many scenarios in which it actually makes more sense for

a big data user to actually just join the data trust formally, in which case there's, there's actually the way that the governance structure is set up within the data trust agreement, There's pretty, you know, well thought through and and explicit rules for how 1 becomes a part of the trust and how one's membership is approved. Now if you're if you're a member, that's great. It makes it easier because

you are able to participate in the governance and, the decision making process around what data becomes available and how you can use it. But, certainly, we'll continue to build around this notion of of third party access as well. And,

there you know, as long as you have restful APIs stood up and a and a good authentication and authorization system behind it, I think there's there's a lot of existing technology that can be brought to bear on on solving the particular problem of access control. So going more into

technical aspects

of BrightHive, I'm wondering if you can talk through the existing architecture

of how you manage the overall

storage and access and governance of these data trusts and some of the ways that that evolution has evolved as you've had different use cases put forth and worked with different organizations

to be able to meet the you're meeting all of their needs while being able to have a maintainable and sustainable architecture that you can build from? So at the very core,

we took the tech

very early on to use the microservices architecture style

as the way of evolving our software, of creating and evolving our software. The reason for that was

very early on, we realized, you know, we wanna use services

like AWS. We wanna be able to use services like Google Cloud. But we also recognize that there was a lot of cases that users would come to us and they would say, well, okay. It's great that your infrastructure runs

on these cloud providers, but we also

want to have this software running in on our infrastructure

in house. That way, we have more control over it. So

by leveraging the microservices

architecture style,

using Docker containers,

exclusively to build most of our infrastructure,

We've created

a platform that

scales very well,

but also

supports those needs of being able to run-in different kinds of environments.

For the typical data trust that we've developed, the ones that we are actively managing ourselves,

most of those data trusts currently reside

on Amazon Web Services.

We use a lot of

industry standard tools such as Terraform

for establishing

the

networking

and the basic infrastructure needed to run the data trust. To manage these containers that I mentioned that make up the various components of the data trust, we use

services like Kubernetes

to do the container orchestration

management, etcetera.

Internally,

our software, we use a lot of Python in house,

but, you know, we also

make use of other languages including JavaScript,

Golang,

and so on for some of the work that we that we do that's not necessarily

core to the data trust. We rely a lot on open source technologies. So for our ETL, we use

Airflow and DAGSTER,

for for helping us to build out and manage our ETL pipelines.

You know, so overall, what our what our current

goal is for 2020

as we've developed this

first round of what a data trust looks like.

We are now

finding ways to streamline a lot of these processes. So, you know, we're getting better with the way we're making use of our microservices.

You know, we're getting better at orchestrating

our microservices.

We're also looking at ways of

scaling this a little bit more effectively, like making it a more hands off type of a data trust

where, you know, we can actually

dynamically spin up a data trust on demand for an individual user

who doesn't necessarily want a

large data trust that

has a lot of elements to it. You know, they might be in an exploratory

stage where they're just trying to get their feet wet with the data trust.

So that's really what we're looking for for 2020 with respect to taking the technologies that we currently have and looking at the

business needs that we've identified

and try to build a more scalable product

moving

forward.

So let me give you just 1 example of of the of what makes the microservice architecture so attractive,

for the particular types of problems that we're trying to solve.

1 of our data trusts is specifically set up amongst

actually, many of them are are generally,

addressing the problem of talent pipeline management or,

education to work. And,

in that

scenario, it's it's often the case that

individuals who are trying to decide between different opportunities for higher education,

whether it's vocational ed

or, a 2 year college degree or a 4 year college degree, they're in a scenario where they have a lot of alternatives available to them, but but no data that's helping them make the decision about which might have the the best return on investment for their particular scenario.

And return on investment data is something that that we've we've been deeply involved in, the thinking around for for quite some time. Greg mentioned the workforce data initiative that was 1 of the the early,

efforts that led to the creation of BrightHive in the first place.

And,

as well, there's certain regulatory reasons why ROI data is becoming more and more important and relevant to

education

and, workforce organizations.

So, the issue is, though, that if you actually wanna calculate ROI,

you you need access,

to individual level wage records

in most cases. You need to know how much,

graduates in a particular program are earning, whether they're working in the field that they've been trained in.

And and that sort of data is, as you might imagine, extremely sensitive. Generally, it's housed at state agencies or federal agencies,

and the rules surrounding the use of that data are really quite strict, as as you would hope that they would be. What our early architectural decision to support

both, you know, the the big cloud providers but also

hosting certain components of the of the platform

in house and on premises,

allowed us to do was to, work with, this particular client

to deploy the actual,

metric calculation, the ROI calculation engine,

locally on premises,

and, nevertheless, to be able to take the outputs of it, ingest it into the data trust, and make those available to other

calculation engine, which,

which we don't actually we

as BrightHive don't even have the ability to query directly.

We can't find out how much you earned last year, which again,

it's good. We shouldn't be able to. But the rules surrounding the aggregation,

for example, the the limit of the smallest cohort

for which you're allowed to calculate means and medians and other aggregate statistics.

That's all implemented in code and kept isolated on the on premise instance, and then we're able to communicate with it securely and get what we need and incorporate the outputs into the data trust without actually having to be in position of managing this extremely sensitive dataset,

ourselves on a cloud provider. Yeah. The the issue of data privacy

protected

and

covered in certain regulatory environments

of the owner of the data that somebody who is partnering with them in a trust

either doesn't have the controls in place for or doesn't have the authorization to access. And so I'm interested in how you approach some of those types of scenarios. For instance, if somebody is covered by HIPAA data, and then there's another member of the trust who's providing information about employment, and they wanna be able to perform

some sort of aggregates across those 2 datasets in either direction, how you handle the aspect of the person with the employment records not having the controls

and agreements in place for being able to access some of those HIPAA protected elements, how you ensure they're still able to be able to gain some value from the trust. Yeah. And and that's a really important problem, obviously, and it's 1 that, that we're actively working on right now. So I don't wanna give the impression that we've solved the secure multiparty computation problem entirely and are are ready to deploy it for anybody who comes to us with their checkbook open.

So, you know, just

full transparency here. This is, this is a set of features that we're actively working on, and we think we have a conceptual solution to apart from the scenario where there is a trusted entity, a trustee who does have the ability to access both datasets, in which case the problem is is much more straightforward. To handle the situation where there are 2 datasets that need to be combined to produce some metric, but there's no organization with the authorization or the trust relationships

to access both. You either need some of the speculative

or actively,

under research

secure multiparty computation technologies,

or you need to create a jointly governed entity like a data trust,

which can

execute a pipeline or, as we call it internally, a DAG, and

be able to take encrypted copies of both datasets,

decrypt them using using keys

to which neither neither party has access, perform the calculation, destroy the data, return the results.

And,

and we think that we have a pretty good idea around how to do that, and it involves

basic you know, public key encryption of both datasets that are the input, a well defined input specification

for what goes into a particular DAG and as well as what comes out of it, and then, of course, pretty strict logging and authorization

controls around when that DAG is allowed to be run, by whom, and keeping track of each run. So it it's basically like you could imagine spinning up a compute that we, as BrightHive,

you know, just can't log into. It's just a a Lambda function or or short lived

e c 2

instance performing the calculations on data that's decrypted,

and then re encrypting it, the output, for each of the parties who's allowed to access it and then distributing that through the trust. So that sort of approach where the computation

is happening on infrastructure that's that's owned by the trust, if you will,

sort of gives everybody involved a measure of control

and

and allows

not just the access to the data to be logged, but but the access for what purpose to be logged. Like, oh, the data was accessed, but this particular routine

was run on it that you can look at, like the code is available, and then the data is destroyed afterwards and the outputs are exposed, I think is is the kind of scenario

that you could imagine working in an environment where everybody trusts each other, sort of, but has really strict compliance requirements that they have to live within,

and really strict, you know, relationships and consent

agreements with the the people whose data is actually being affected

that they that they are worried about, for from a number of different obvious perspectives, public relations and

privacy breaches and everything else. Like, we we have to reassure

everybody involved that they have a a high degree of control over what's happening and that all the i's are dotted and all the t's are crossed in terms of compliance

and that, everything is,

obviously,

encrypted and

handled appropriately according with best practices. And and we think a trust is a good environment in which to to meet all those requirements.

And then somewhat tangential to the idea of privacy and some of the issues of regulation around that is the idea of ethical uses of the data, which is something that is very subjective and hard to define, specifically using technical guards. And I know that you have some approaches in the legal frameworks that you put forth as to how to identify some of the guidelines of how to

make sure that you're using

ethical practices in these data analyses. And another element of that is issues pertaining to bias that exists both in the source data and the algorithms that are used to process it. And I'm wondering

what your involvement is with the members of the trust to help them identify ethical best practices and ensure that the ways that they're using the data and processing it and,

trying to counteract bias in the source sets is up to the sort of current industry best practices.

And any issues or advice that you have along those lines. Yeah. This is so important, and it it's really again, like, it it's in our DNA to care a lot about this, and it's it's actually driven our our decisions

largely around how we grow and who we work with.

The the idea of the data trust is broadly

the software we've written is, I think, appropriate in a lot of different contexts,

we've limited our work so far and the clients that we've taken on largely to the education to work domain, which is 1 that we understand very well. We have a lot of folks on staff with deep subject matter expertise

who have the ability to to look at the particular problems that are that the data trusts are trying to solve, identify potential issues of bias, or some sort of other ethical issue that might arise

and actually bring their own expertise to bear on it, or at least to be able to issue spot and get questions that might arise up to the governance committee to deal with. Because we've

circumscribed the realm in which we're working with our early clients,

I think that that has given us a lot of comfort that we're working with these early clients very closely, and we know what they're doing and why we have the in house expertise to be able to spot potential issues. As we grow as a company and as we move beyond the education to work domain, I think this is it's a harder problem to solve unless you wanna basically staff up in in every single domain that may exist. If we were to move into health care, for example, would we have to go and hire folks who have been working in health care IT or in actual health care practice,

long enough to be able to, you know, be able to issue Spot with the same sort of rigor that we're able to in the domain that we currently work in. And I think the goal is that as we move on as a company to that to that place where we're working across a bunch of different domains, that we wouldn't have to necessarily, but we probably would continue to offer that as part of our services offering, in a lot of different scenarios.

So it's a combination of making sure that we're staffed to handle the trickiest bits. Like, if we if we start working with health data, having somebody in house, at least 1 person who is able to weigh in on those issues and spot them. But then, also, as as we're building out the software, to be able to use tools like Macie, AWS Macie, and and others to bring some machine learning to bear on this as well to be able to at least flag potential issues like, oh, it looks like this is a gender field. Are are you, are you accounting for the fact that gender can change for some individuals or that they're you know, this shouldn't shouldn't be stored as a binary,

or, Oh, hey. It looks like this is a social security number. Are you sure you want to publish this? Things like that can be flagged in an automated way.

I don't think it's sufficient to just rely on automated flagging, which is part of why a governance structure exists in a data trust. You would hope that the decision to publish a particular data resource, if it's being reviewed by by multiple parties who are contributing data to it, that that that review process would highlight a lot of these issues. But given that the data trust idea is new to a lot of folks and the governance structures that we're setting up are still new, we do feel like it's incumbent upon us as as a vendor to keep our own human eyes on a lot of what's happening so that while we're in the process of automating some of these these ethical controls,

we have highly trained individuals who are helping us, guide us along that path. Yeah. And I also say too that we have,

security consultants

that are external to our company that we do work a lot with. So oftentimes,

especially on the technical

and as we're making certain

technical decisions, certain architectural decisions,

we do consult with,

our security team to make sure that these decisions

are in line with the best interest of protecting the data of the users that we're entrusted with.

And then another

element

of the concept of the data trust is that the organizations

that are coming into the trust and sharing their data obviously need to have some sort of technical capacity for being able to maintain their end of the system as far as storing the source datasets, securing it in their own means, updating them according to whatever

frequency

they need to to make sure that the data is fresh and that it's valuable to every member of the trust. And I'm wondering what you have

seen as far as some of the

common needs

on their end as far as being able to participate in the trust or any challenges that they have as far as fulfilling the technical aspects

of being able to be a member and ensure that they have sufficient uptime and availability

and accuracy in their source data. Yeah. There's there's so much variance,

especially if you imagine working with nonprofits,

foundations, and government agencies. You can about imagine

the the huge variety of

of structures that exist in database legacy database systems and and,

internal technical staffing models.

And it's always the biggest challenge is

if you have 3 organizations who've never shared data with 1 another before,

they each have different cultures around around the way that they share it, the way that they manage it, the way that they keep it updated, and that that culture shock that happens when you throw the 3 of them into the same room together is something that,

again, we rely on humans to actively manage at this point. I would say that in general, our assumption

going into this work, which informed our initial thinking around the architecture itself, was that API access

to the the various data resources

and and then automated ETL processes that posted updates to those APIs

was was something that would work for, you know, not every data trust member, but but a large subset of them. And as we've gotten into it, I think we've come to appreciate that there's actually another class of actors for whom that's just not that just is a mismatch between the way that they work with data and the way that they the way that they both report it and use it internally. They'd much rather start from the place of generating a report, uploading an Excel spreadsheet, dealing with flat files,

rather than restful APIs.

And so we are adding capabilities to our platform that support those kind of users as well because, as you mentioned, the technical capacity issue is 1 that prevents a lot of data collaboratives from forming or from being sustainable in the first place. And you kinda have to meet the the various members,

where they are.

You can't necessarily

create,

data management and data engineering culture, from scratch where none exists just for the sake of of a single collaborative project.

You you really have to build tooling that that is appropriate for the audience that that needs to use it. Greg, do you have anything to add to that? No. I think you covered it really well.

Absolutely. As you said, there is a lot of variance, and we've actually just as an as an organization,

we've spent a lot of time

for some of our earlier data trusts

actually helping them to

understand

the data that they have in terms of,

you know, what the value is and also being able to help them to building processes to actually help them to

extract data into the data trust. So that's been

a really big part of some of the really close work that we've been doing with a lot of these agencies lately.

Yeah. We've had we've had customers come to us and say that the process of creating a data trust and the process of working

with BrightHive

has actually caused them to take a step back and and think more

think more broadly about the culture of data sharing and data management across the,

across their,

various agencies

and actually try to to make

a a deeper change in the way that they're operating. Even if it's not necessary just for the sake of the data trust, they they feel like they've been able to appreciate all the different approaches that exist out there and, with that broader view, revisit the way that they're they're handling data internally. And that and that's really cool. In order to scale this company,

we won't be able to work with every single customer at the level that we are with our early ones. But it's really rewarding to see those early customers be able to to make those

connections

and, and build on what we're doing, in a way that that just sort of spans,

all the work that they do, whether it's collaborative or not.

So true. We you know, I'll say 1 time,

we looking back at a particular customer that we have,

seeing them go from

this area where it's it was really difficult to get all the data sharing agreements in place to get the players at the table to actually share data with us.

And seeing that paradigm suddenly shift from,

yeah, it's gonna be difficult to get you this data

to, hey,

can you add new data that we have into the data trust?

You know, that's

that that was huge for me. I was like, wow. You know, they've they understood the value of the data trust and what it does for them.

Another thing that I'm curious about is the life cycle of these trusts that you're working with and whether they are intended to be short lived and only

exist for the duration of a particular project, or if it's something that are generally set up as more of a long term engagement where there is no set termination

point. And

as a corollary to that, I'm curious how you're approaching

sustainability

of BrightHive itself to ensure that for any of these data trusts that are designed to last in perpetuity

to ensure that you're there for the lifetime of that project as a support mechanism and to make sure that the data trust continues to be viable? To take the first question

first,

I think a setting up the full governance structure and, and getting to the point where everybody can sign off in the data trust agreement

is time consuming enough and creates enough new entities, if you will, and new relationships

that I think it would probably be overkill for just a quick 1 off pilot.

I think if you were really just doing something that were strictly time boxed, you wanted to to try a new approach to a particular problem or or generate a particular dataset for a researcher to look at, you probably just sign a 1 off data sharing agreement and do the thing, and then that's that. The idea behind a data trust is that it is extensible and and,

governable over the long run, and so sustainability is is, I I think, pretty fundamental to the model,

the idea being that in in a lot of these cases,

agencies

who are affecting the same population

or, philanthropies who are serving populations that are also receiving services from from other related groups, and they have a long term need to be able to work together.

And providing the venue for them to do so over the long run is is kinda what we're here for, as opposed to something that's that's more limited in time and scope. But that does raise the issue of vendor lock in and the sustainability of BrightHive as a young company, and and that has driven a lot of our strategy around open sourcing.

It it's for us, it's a very tactical decision. I'm not a I'm not a

open source fundamentalist,

if you will, but, but I am a believer that it it really does have very important uses

for businesses, especially young businesses,

because it provides the comfort to the clients that if we were to disappear

or if a new competitor were to come along who were who is offering something that's that's better and more appropriate, well, they own the data. They have,

access to the core of the code, which would allow them to extract it, or even keep the services running, and, therefore,

it it reduces the fear in their part that all of a sudden BrightHive disappears and and their software is no longer supported. Like, there is an alternative model where they either internally or

or work with 1 of our partner organizations or some other vendor,

to make the transition,

with all the documented source code available to them, and that seems to have helped a lot. And in terms of the types of trust that you've worked with and some of the outcomes of them, I'm curious what you have seen as being the most interesting or innovative or inspirational ways that you have seen the BrightHive platform used as well as this broader concept of a data trust being leveraged.

I can give 1 really good example of 1 of our data trusts.

So the data trust that that I'm gonna talk about a little bit does a lot with workforce development.

So, you know,

a lot of services that they offer

to individuals

who might be displaced employees.

They may be veterans.

You know, they might be looking for just ways to improve themselves.

So, you know, you have this group of agencies that are doing a lot of really good and valuable work,

but

they don't actually have

much insight into just how effective the work is that they're doing. So 1 of the eye opening moments for me was after we'd gotten our data trust set up and data started to flow and we had our go live and, you know, they started to get data into the data trust and there was analysis done and,

data being used in these 3rd party dashboards and applications.

And I'm sitting in a meeting

and 1 of the individuals

who is in charge

looks at us and says, wow, you know,

for the first time, we're actually able to see trends

in things.

Just, you know, things that might not be

very interesting to the wider population. But, you know, here we're seeing this trend where individuals

are coming into our service centers between these 2 hours and or

we're seeing more applications for services

at

2 in the morning, for instance. So just,

you know, the the mere fact that

they're suddenly

unlocking the power of their data to gain insight that they wouldn't have otherwise been able to gain. It it made all the difference in the world for me, and it really was the thing that cemented the reason why I do what I do at Bright House. Tom, do you have anything to add? I think 1 of the most fun things about working here is

actually

going to these government these governance meetings

and and seeing the folks who are contributing data to the trust and consuming data from it have these have these really generative moments of, oh, hey. We have these data in in the trust. We could do these thousand other things, and to see the see the wheels start turning. Right? We we've been doing this now,

with most of our customers for months, not years. But to see those light bulbs come on and to to hear people talk about

the ways that they want to

be able to use this trust and and the data that they're putting into it 3 years from now to do something totally transformative.

The fact that people are approaching

data collaborative

with that sort of spirit of generating new ideas and generating,

innovation in the way that services are delivered,

that's a it's a weird venue for it in some ways. Right? Usually, data should be

supporting

that kind of innovation, but not you wouldn't necessarily always think it would be driving it. And yet being able to actually look at outcomes, being able to look at the way that services are being delivered to individuals from a number of different directions

seems to open people's eyes in a way and make them step back and think about the bigger picture that it's really conducive

to making new connections and and coming up with new ideas.

I have some examples that I could refer to that have happened already, but what I'm really excited about is as these trust

relationships between the members of of our, of our data trust solidify

and as these new ideas keep coming up, I think we'll we'll actually see some pretty meaningful transformations

in the way that services are delivered to to individuals who are just entering the job market for the first time.

As agencies who have just never worked together

seriously in the past start to collaborate

at at an individual level and track their referrals, whether they're working or not, and understand the the various barriers to employment in a truly robust sense where it's not just, oh, this person needs a college degree,

but maybe, also, this person doesn't have a car and needs a way to get to their work or is struggling with a mental health problem. To take a to take that sort of holistic approach to case management

and to see

an individual who a case manager may have been working with for 2 years,

but never really understood the core of what was causing them,

to to not be able to take the steps that they wanted to take.

That's that's really cool. Like, it it it's, I think, potentially transformative. So I'm looking forward to seeing that materialize over the next months years.

And as we look forward into the future of BrightHive and the future of how data cooperatives and data trusts are being used as data continues to be central to almost everything we do. I'm wondering what you have planned and some of the trends that you foresee as we move forward. I think 1 of the big trends that we're going to see is is 1 that we're already preparing for, which is the expansion of these data collaboratives beyond

particular

subject matters, let's say, or silos,

because,

obviously,

education to work already includes a number of different stakeholders.

It includes

public and private education providers,

software boot camps, it includes employers,

k 12. It's already a pretty large community, but as I mentioned before,

we understand that barriers to employment go way, way, way beyond that.

And there's so much curiosity and and openness to this notion of kind of whole person care, if you will, now, that's emerged out

of research and discussions over the past, I would say, decade or so. Seeing that come to fruition and seeing agencies

and philanthropies

who are working on seemingly disparate subject matters make connections between them

and be able to make better decisions and drive people in in better directions

based on something that's happening in a place that's kind of in another silo or another part of the state government

altogether,

and they just never thought about much before, is, I think, something that's gonna happen. It's coming from the top down as well as the bottom, where folks who are working within these agencies

are getting curious about it and and pushing their, pushing their bosses to to,

to take that next step

and also where

political appointees and agency heads are coming in and saying, hey. We need an actual data strategy.

And

I think those 2 forces pushing at the problem from either end is is leading to the, the expansion of people's idea of what's possible in terms of collaborating with data.

That's thing 1. And I would say thing 2

is this notion of sustainable data governance.

To me, this is the biggest

problem that is yet to really be solved with technology in the realm of

data engineering,

is

okay. You can build within a single enterprise,

enterprise, you know, a data bus, a set of

tools for collaborating around data. But once you go beyond a single chain of command and involve a bunch of different stakeholders who are coming at it on, basically, a level playing field where everybody needs to agree and sign off on a decision before it's made.

I don't think there's great technological support for those types of organizations right now

and and for the ongoing

support, governance, and maintenance of those relationships once they're established for a given purpose.

So I see, first of all, the demand,

for this type of arrangement,

continuing to grow. And then second of all,

the the technical

space for for software tooling to kinda solve that real problem that's preventing those, those relationships from being sustainable and and and and really productive. Are there any other aspects of the work that you're doing at BrightHive or the concept of data trusts and data collaboratives or any of the engineering efforts

that you're involved with that we didn't discuss yet that you'd like to cover before we close out the show? I think

no. I think we did. We pretty much covered the bulk of what,

what's sort of on the BrightHive horizon,

so I I feel pretty good. I will say, though, that 1 thing that's sort of a that was sort of an eye opener for me

over maybe the last 6 months

is just as

peep people are becoming so much more aware of privacy

and especially privacy of their own data. And it's not just necessarily the sort of data elite that are thinking in those terms.

So I see that as society in general starts to think more carefully about the data that it's actually sharing. I see this concept of a data trust actually

becoming even more and more widespread

than it currently is.

Yeah. Absolutely. And and driven in part by regulations too. Right? The CCPA, the GDPR,

like, all those 4 letter acronyms. I I don't think that's the end of the story around data privacy regulation. I think it's just the beginning. So not only are consumers starting to ask questions

and citizens starting to ask questions about how their data is being used,

but their their representatives

are starting to set some some pretty clear boundaries around it. Not entirely clear to me that a company like Palantir,

is set up in a way that is

compatible,

with some of those controls

and whereas I think a data trust really can be. So it it it I think I think that the regulatory environment is another thing to keep an eye on as well as citizen expectations.

Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today, and I'll start with you, Tom. Well, like I said, I think the biggest

I think that so many of the technical aspects of sharing data,

actually have a really robust ecosystem around them. I think

the the piece that's missing is is being able to to bring all of it together

and make it sustainable,

and manageable by the by the data custodians

themselves

as opposed to relying

on everybody

collectively

going and signing up with a single vendor and having a uniform

IT environment across, you know, the entire ecosystem.

I think it's largely impractical, especially in in something like

the social sector where there's a whole bunch of different actors.

So to me, that that

the the glue that connects,

data infrastructure

provided by multiple vendors

and the governance

structure that makes it sustainable

is is the biggest missing piece right now. If you're a single enterprise with a single decision maker at the top who can say, use this vendor software,

I think you can solve a lot of your data management problems that way.

Or if you have a large enough IT staff where you can, you know, take the open source tools that are out there and and glue them together, and then you can solve the problem that way.

But once you start introducing

multiple stakeholders into the into the equation,

I think the

the technical model and the governance model are not standardized

yet, haven't settled on on anything

tend to agree with Tom as well. Honestly, I feel

in terms of our technology,

we definitely understand a lot about

building and sharing data just, you know, out of practice. You know, we've

not just BrightHive, but, you know, as an industry, we've been doing data management for a long time. But, again, I think that the ways of

automating the legal and governance part of the data trust management

is really something that's

gonna be very important moving forward. You know, as we sort of look at things like blockchain and what it tries to bring to the floor. You know, how do we take some of this

knowledge of state of say things like smart contracts

and apply them to things like data sharing agreements within a data trust. So I'm really excited to not only

see how things like those

smart contracts,

etcetera, will help to shape our thinking, but I'm also excited to be leading a part of the work that we're actually thinking about this stuff as it applies not only

to BrightHive, but also to just data sharing, data trusts in general. Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing at BrightHive. It's definitely a very

interesting area of the overall work being done with data management, and I'm excited to see some of the future work that you do and some of the outcomes of the data trust that you're helping to build. So thank you both for all of your efforts, and I hope you enjoy the rest of your day. Thank you, Tobias. You too. Yep. Thanks for having us.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links