Summary
Every business collects data in some fashion, but sometimes the true value of the collected information only comes when it is combined with other data sources. Data trusts are a legal framework for allowing businesses to collaboratively pool their data. This allows the members of the trust to increase the value of their individual repositories and gain new insights which would otherwise require substantial effort in duplicating the data owned by their peers. In this episode Tom Plagge and Greg Mundy explain how the BrightHive platform serves to establish and maintain data trusts, the technical and organizational challenges they face, and the outcomes that they have witnessed. If you are curious about data sharing strategies or data collaboratives, then listen now to learn more!
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Tom Plagge and Gregory Mundy about BrightHive, a platform for building data trusts
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what a data trust is?
- Why might an organization want to build one?
- What is BrightHive and what is its origin story?
- Beyond having a storage location with access controls, what are the components of a data trust that are necessary for them to be viable?
- What are some of the challenges that are common in establishing an agreement among organizations who are participating in a data trust?
- What are the responsibilities of each of the participants in a data trust?
- For an individual or organization who wants to participate in an existing trust, what is involved in gaining access?
- How does BrightHive support the process of building a data trust?
- How is ownership of derivative data sets/data products and associated intellectual property handled in the context of a trust?
- How is the technical architecture of BrightHive implemented and how has it evolved since it first started?
- What are some of the ways that you approach the challenge of data privacy in these sharing agreements?
- What are some legal and technical guards that you implement to encourage ethical uses of the data contained in a trust?
- What is the motivation for releasing the technical elements of BrightHive as open source?
- What are some of the most interesting, innovative, or inspirational ways that you have seen BrightHive used?
- Being a shared platform for empowering other organizations to collaborate I imagine there is a strong focus on long-term sustainability. How are you approaching that problem and what is the business model for BrightHive?
- What have you found to be the most interesting/unexpected/challenging aspects of building and growing the technical and business infrastructure of BrightHive?
- What do you have planned for the future of BrightHive?
Contact Info
- Tom
- Gregory
- gregmundy on GitHub
- @graygoree on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- BrightHive
- Data Science For Social Good
- Workforce Data Initiative
- NASA
- NOAA
- Data Trust
- Data Collaborative
- Public Benefit Corporation
- Terraform
- Airflow
- Dagster
- Secure Multi-Party Computation
- Public Key Encryption
- AWS Macie
- Blockchain
- Smart Contracts
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, and they've got GPU instances as well.
Go to data engineering podcast dot com /linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Pareteum Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data Conference, and PyCon US.
Go to data engineering podcast.com/conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy, and today, I'm interviewing Tom Plagg and Greg Mundy about BrightHive, a platform for building data trusts. So, Tom, can you start by introducing yourself?
[00:01:48] Unknown:
Sure. Hi. My name is Tom Claguey. I, came to BrightHive, around about the time it formed, actually. Myself and the CEO, Matt Gee, had been working together at the University of Chicago and then had been doing some consulting work. And, so as we were creating BrightHive, I was working for another data science startup and was convinced by, by Matt and by the vision that you put forward to come over and lead the product and engineering team here. And, Greg, how about yourself? Sure. Hi. I'm Greg Mundy.
[00:02:19] Unknown:
I come to BrightHive, after previously working as a government contractor, for a local company here in West Virginia, for a few years, most of the work that I've done was actually focused in data management. So I got involved with BrightHive mainly from some of the work that they did out of the workforce, data initiative in Chicago. So it was really nice to come to a company that I could leverage a lot of the experience that I had for data management, building data pipelines, data systems in a very, very
[00:02:56] Unknown:
interesting company. And, Tom, do you remember how you first got involved in the area of data management? Sure. So I came to Chicago, where I currently live,
[00:03:05] Unknown:
in about 2009, and I was, I was actually a a postdoc in astrophysics. So I was working with, pretty decent volumes of data on the experiments that, that we were working on. I completed my postdoc and, afterwards, started looking at faculty jobs and decided that wasn't the route that I wanted to take. So, I I kinda took advantage of the community that existed at the University of Chicago around, data science and and the social sector. There was a center forming at the University of Chicago as well as a fellowship program, called Data Science for Social Good that was spinning up just as my postdoc was ending. And, through discussions with the folks who were setting those those new entities up, I kinda realized that that was the route that I wanted to take. And and, of course, coming from a academic background, the skill sets and the technologies are are a little bit different, but you do kind of get a broad exposure to a lot of different ideas. And so I had some some learning to do, as I switched over, both about the technology and about the needs of the organizations that, that we were working with, largely nonprofits and, government agencies. But that's that's kinda how I got started. We, we started trying to do pure data science machine learning work, and then and from there, realized that a lot of these social sector agencies actually had needs in data management and data engineering just as much as they did analytics, if not more so.
And so I I started learning about the field and and
[00:04:35] Unknown:
and building my expertise there. And, Greg, do you remember how you first got involved in the area of data management? Yeah. I think
[00:04:41] Unknown:
I I like to say I came into the field somewhat by accident. In grad school, I spent a lot of time doing, well, data mining at the time, So a lot of knowledge engineering, machine learning. Then when I got into industry, I I actually spent some time as faculty for a few years. And then I got into a company where we were doing a lot of work with data from NASA and NOAA. Then eventually I shifted in that same company over to a project that was doing large scale data archival for, NOAA scientific datasets. So after spending a few years in that sector and really developing a lot of the skills of, you know, managing these large scale data systems, you know, being able to actually extract useful data from these systems in a timely fashion. It dawned on me after a while that, you know, the social good of, you know, being able to take these large volumes of data and actually find ways of doing something that benefits society with it. That's really the thing that got me even more interested in the field.
So when the opportunity to do work, through data science for social good, with the workforce data initiative came along as a voluntary thing, I jumped on board because it was more of using the skills that I developed from working on NASA and NOAA data to actually do more in the social sector.
[00:06:16] Unknown:
And so at the opening, I mentioned that BrightHive is a platform for building data trusts. And before we get too much into BrightHive specifically, I'm wondering if you can give a bit of a description about what a data trust is and some of the reasons that an organization might want to build 1. Sure. A data trust is a so it's a form that a data collaborative can take, and, it's a particular form that has a governance structure attached to it, a trustee, who's generally a, a party who all sides who are sharing data with each other
[00:06:48] Unknown:
uniformly have faith in as a as a steward of the data, and it also has a technology backbone as well so that all data trust is really appropriate for multiparty data sharing. It's not really for internal collaboration within an enterprise. I think that that problem has largely been addressed by other by other actors in the industry. It's it's when you start to go across agency lines or across public private lines that you really need a lot of the infrastructure and around, you know, data sharing agreements and and very strict logging and controls over who's using the the data that you are exposing for what purposes.
You you need to draw a lot more lines and, and introduce a little bit of a different type of technological solution. And and it's 1 that we recognized at BrightHive pretty early on was was kind of a missing piece in the social sector in particular. A lot of data sharing agreements that are signed between governments and and private industry or nonprofits are kind of 1 offs, generally used for piloting or for, developing 1 particular set of metrics for reporting, and then it goes away. And, the next, agency head who comes in kinda has to reinvent it, start over from scratch. So we realized early on when we were working with these types of entities that something more sustainable and extensible would really help the field move forward and and help build a a true social sector data infrastructure where people were collaborating, to serve the to serve the people who needed their help. And so it was like the match between this particular, form of data collaborative, which grew out of, actually, the intelligence community, and what we saw as the needs of the social sector that that really was the impetus behind BrightHive in the first place.
[00:08:48] Unknown:
And that brings us to BrightHive specifically. I'm wondering if you can talk a bit more about what BrightHive is both as an organization and from the platform perspective?
[00:08:58] Unknown:
As an organization, we are a we are a public benefit corporation. We started out, as a consultancy called Impact Lab, and that in turn formed out of the data science for social good fellowship that was hosted at the University Chicago. That fellowship was designed to put largely grad students, in computer science and engineering and hard sciences and the social sciences, to work on machine learning problems that were facing governments and nonprofits. And certain types of problems are amenable to a 3 month fellowship program. Many are not.
So as the fellowship grew and expanded to a second, 3rd year, we we saw the need to take the kinds of problems that weren't amenable to that sort of, you know, drop in, drop drop out type structure and put a little bit more sustained effort into them, so we created the consultancy, and Mekki and myself and as as well as Andrew Means were the principals. And then the 3 of us spent about a year fleshing out what we saw as the the needs of the sector and then moved from a consultancy to a product company. That was when we created BrightHive and absorbed our old consultancy into it. And so, it's always been you know, deep in our DNA is the idea of building data infrastructure with a social purpose. Right? That's how we got into this. That's how the 3 of us started working together. It's right there in our in our company charter. Is data collaboration is good overall, and more people should do it, but we're particularly interested in solving the problems which address the needs of, let's say, low income individuals who are who are trying to get education and jobs. We've we've worked on problems related to homelessness and other, I'd say, broadly conceived
[00:10:44] Unknown:
social mission aligned work, and we definitely continue to do so. That that's really who we are. And you mentioned that a data trust is a specific instance of what you categorize more broadly as a data collaborative. I'm wondering if you can just talk a bit more about some of the other forms that that might take. Yeah. I would say, you know, if you're thinking about multiparty data collaboratives,
[00:11:06] Unknown:
a lot of times, the form that takes, and, Western Pennsylvania is a great example of this, is is 1 particular agency or oftentimes academic group setting up a giant data warehouse and negotiating bilateral data sharing agreements with all the parties who agree to pool their data, and then they host the single data warehouse. They control access to it, and that's actually a really good model if you can get everybody to agree to it and if there's sustaining funding available and all sorts of other you know, many things can go wrong, but, but certainly, it's worked quite well in Western Pennsylvania. I would say the thing that makes a data trust a bit different is even though there is a trustee and a centralized governance function within a data trust, it's really intended to be multilateral rather than a set of bilateral agreements.
The idea is that you have n members of your data trust, let's say, a department of higher education, a workforce, workforce board, maybe some large employers. Each of them exposes to the others the metadata about the datasets that they're willing to share. Each of them can ask for or publish particular data sets to other members, not necessarily all other members, and the members together can decide on metrics and calculations that can be performed on their administrative data to produce, let's say, aggregate metrics or anonymized datasets for use by analysts, which truly do pool their data, deduplicate, create new entities, and, you know, produce a data product that's really owned by the trust collectively.
So that's the distinction that we draw. We don't really have a centralized warehousing structure. We really more have peer to peer API based, data sharing and a set of microservices that, that control access to it and log and and manage all of the, the compliance requirements, which can sometimes be quite extensive. So I think that's what makes it different than, than a data collaborative of the sort that, let's say, here in Chicago, exists around the data from the Chicago Public School System. That's managed by Chapin Hall, and and they are in charge of it, and everybody works through them. Our goal is not to make that obsolete, but to allow for situations where there isn't an entity that can serve that role to still have a way that they can practically,
[00:13:30] Unknown:
work together using data. And 1 of the things that you mentioned there that I'm interested in digging more into is this idea of the ownership of the derivative datasets or aggregate information about the different entities contained within the data owned by the different members of the trust and some of the complications that arise in terms of where the intellectual property would lie as far as any algorithms or derivative data products that come out of the,
[00:13:57] Unknown:
information that's available in this trust? Yeah. Let me break that into 2 parts. 1st, the algorithms. Because of the space that we're operating in, we take a pretty opinionated stance that especially if public sector is involved, the algorithms themselves that are being used to create, let's say, metrics or scores or any sort of other derivative data products, even if the underlying data for privacy or intellectual property reasons is kept under lock and key, that the algorithms themselves ought to really be made public if at all possible. So our stance on this is, you know, publisher code, basically, but, you know, not that's that's not going to work in every circumstance. There's there's obviously exceptions to every rule, but it's our, it's our conviction that that's kind of the right path forward for, social sector organizations, for philanthropy, and, for public public organizations.
But in terms of the data itself, I think you raise probably the single most important point. And, in in a sense, the purpose the data trust is to create and and have every party sign a data trust agreement, which which strictly lays those conditions out and also sets up a structure where each of the organizations sends in a representative and talks through things like, oh, we'd like to create this new combined dataset, or we'd like to publish this particular metric. It requires data from you, you, and you. Let's, let's all work together and figure out how to do it, and then the, the data product is then owned by the trust and with the trustee as the custodian of it. And, you know, it it's a structure that, I think, is not, at least at this 0.1 that runs on autopilot. So BrightHive, in addition to having a product and engineering team, has a pretty strong services component that helps organizations think this all through, work through the data trust agreement and the legalities, clients aspects, to get to the point where they're able to make those decisions collectively and implement them in code. It may be 1 day when everybody knows what a data trust is and there's a 100 templates to work from. It will be possible for a data trust or for a group of of agencies to come together and build this type of collaborative without some handholding, some guidance, but we're not there yet. And so each of the data trusts that we're working with has what we call a data trust lead assigned to it, whose job it is to help make those connections and to talk with both executive level folks as well as the technological the IT teams and kind of bring this to fruition. Because in addition to just the practicalities of how do I calculate this metric or how do I implement this scoring algorithm, you have to get buy in from everybody. You have to get everybody to sign off that the intellectual property is is, is owned by this new thing, and that's, that's a very nontrivial task. And for an existing data trust that's already been established, have you found that there
[00:16:41] Unknown:
are general approaches to how an individual or an organization might gain access to be either a member of the trust or be able to have some limited access to the data contained therein to be able to do some sort of analysis or build additional products on top of it? Yeah. Absolutely. You know, 1 of the data trusts actually, a a couple of them that we have, came together explicitly in order to allow a 3rd party software developer to implement an application that uses the data,
[00:17:09] Unknown:
within the trust. And in that case, it's a matter of basically standing up the OAuth service that, and creating, API keys and making sure that that particular third party developer has access to the data, but also that every access is logged and that a full audit trail is kept. And that's kind of what we designed our initial version of this around was the scenario where the main user of the data within a data trust was to be 3rd party app developers or 3rd party analysts who would be accessing the data via API. I think what we've come to appreciate is that there's many scenarios in which it actually makes more sense for a big data user to actually just join the data trust formally, in which case there's, there's actually the way that the governance structure is set up within the data trust agreement, There's pretty, you know, well thought through and and explicit rules for how 1 becomes a part of the trust and how one's membership is approved. Now if you're if you're a member, that's great. It makes it easier because you are able to participate in the governance and, the decision making process around what data becomes available and how you can use it. But, certainly, we'll continue to build around this notion of of third party access as well. And, there you know, as long as you have restful APIs stood up and a and a good authentication and authorization system behind it, I think there's there's a lot of existing technology that can be brought to bear on on solving the particular problem of access control. So going more into
[00:18:33] Unknown:
technical aspects of BrightHive, I'm wondering if you can talk through the existing architecture of how you manage the overall storage and access and governance of these data trusts and some of the ways that that evolution has evolved as you've had different use cases put forth and worked with different organizations to be able to meet the you're meeting all of their needs while being able to have a maintainable and sustainable architecture that you can build from? So at the very core,
[00:19:02] Unknown:
we took the tech very early on to use the microservices architecture style as the way of evolving our software, of creating and evolving our software. The reason for that was very early on, we realized, you know, we wanna use services like AWS. We wanna be able to use services like Google Cloud. But we also recognize that there was a lot of cases that users would come to us and they would say, well, okay. It's great that your infrastructure runs on these cloud providers, but we also want to have this software running in on our infrastructure in house. That way, we have more control over it. So by leveraging the microservices architecture style, using Docker containers, exclusively to build most of our infrastructure, We've created a platform that scales very well, but also supports those needs of being able to run-in different kinds of environments.
For the typical data trust that we've developed, the ones that we are actively managing ourselves, most of those data trusts currently reside on Amazon Web Services. We use a lot of industry standard tools such as Terraform for establishing the networking and the basic infrastructure needed to run the data trust. To manage these containers that I mentioned that make up the various components of the data trust, we use services like Kubernetes to do the container orchestration management, etcetera. Internally, our software, we use a lot of Python in house, but, you know, we also make use of other languages including JavaScript, Golang, and so on for some of the work that we that we do that's not necessarily core to the data trust. We rely a lot on open source technologies. So for our ETL, we use Airflow and DAGSTER, for for helping us to build out and manage our ETL pipelines.
You know, so overall, what our what our current goal is for 2020 as we've developed this first round of what a data trust looks like. We are now finding ways to streamline a lot of these processes. So, you know, we're getting better with the way we're making use of our microservices. You know, we're getting better at orchestrating our microservices. We're also looking at ways of scaling this a little bit more effectively, like making it a more hands off type of a data trust where, you know, we can actually dynamically spin up a data trust on demand for an individual user who doesn't necessarily want a large data trust that has a lot of elements to it. You know, they might be in an exploratory stage where they're just trying to get their feet wet with the data trust.
So that's really what we're looking for for 2020 with respect to taking the technologies that we currently have and looking at the business needs that we've identified and try to build a more scalable product moving forward.
[00:22:27] Unknown:
So let me give you just 1 example of of the of what makes the microservice architecture so attractive, for the particular types of problems that we're trying to solve. 1 of our data trusts is specifically set up amongst actually, many of them are are generally, addressing the problem of talent pipeline management or, education to work. And, in that scenario, it's it's often the case that individuals who are trying to decide between different opportunities for higher education, whether it's vocational ed or, a 2 year college degree or a 4 year college degree, they're in a scenario where they have a lot of alternatives available to them, but but no data that's helping them make the decision about which might have the the best return on investment for their particular scenario. And return on investment data is something that that we've we've been deeply involved in, the thinking around for for quite some time. Greg mentioned the workforce data initiative that was 1 of the the early, efforts that led to the creation of BrightHive in the first place.
And, as well, there's certain regulatory reasons why ROI data is becoming more and more important and relevant to education and, workforce organizations. So, the issue is, though, that if you actually wanna calculate ROI, you you need access, to individual level wage records in most cases. You need to know how much, graduates in a particular program are earning, whether they're working in the field that they've been trained in. And and that sort of data is, as you might imagine, extremely sensitive. Generally, it's housed at state agencies or federal agencies, and the rules surrounding the use of that data are really quite strict, as as you would hope that they would be. What our early architectural decision to support both, you know, the the big cloud providers but also hosting certain components of the of the platform in house and on premises, allowed us to do was to, work with, this particular client to deploy the actual, metric calculation, the ROI calculation engine, locally on premises, and, nevertheless, to be able to take the outputs of it, ingest it into the data trust, and make those available to other calculation engine, which, which we don't actually we as BrightHive don't even have the ability to query directly.
We can't find out how much you earned last year, which again, it's good. We shouldn't be able to. But the rules surrounding the aggregation, for example, the the limit of the smallest cohort for which you're allowed to calculate means and medians and other aggregate statistics. That's all implemented in code and kept isolated on the on premise instance, and then we're able to communicate with it securely and get what we need and incorporate the outputs into the data trust without actually having to be in position of managing this extremely sensitive dataset,
[00:25:30] Unknown:
ourselves on a cloud provider. Yeah. The the issue of data privacy protected and covered in certain regulatory environments of the owner of the data that somebody who is partnering with them in a trust either doesn't have the controls in place for or doesn't have the authorization to access. And so I'm interested in how you approach some of those types of scenarios. For instance, if somebody is covered by HIPAA data, and then there's another member of the trust who's providing information about employment, and they wanna be able to perform some sort of aggregates across those 2 datasets in either direction, how you handle the aspect of the person with the employment records not having the controls
[00:26:18] Unknown:
and agreements in place for being able to access some of those HIPAA protected elements, how you ensure they're still able to be able to gain some value from the trust. Yeah. And and that's a really important problem, obviously, and it's 1 that, that we're actively working on right now. So I don't wanna give the impression that we've solved the secure multiparty computation problem entirely and are are ready to deploy it for anybody who comes to us with their checkbook open. So, you know, just full transparency here. This is, this is a set of features that we're actively working on, and we think we have a conceptual solution to apart from the scenario where there is a trusted entity, a trustee who does have the ability to access both datasets, in which case the problem is is much more straightforward. To handle the situation where there are 2 datasets that need to be combined to produce some metric, but there's no organization with the authorization or the trust relationships to access both. You either need some of the speculative or actively, under research secure multiparty computation technologies, or you need to create a jointly governed entity like a data trust, which can execute a pipeline or, as we call it internally, a DAG, and be able to take encrypted copies of both datasets, decrypt them using using keys to which neither neither party has access, perform the calculation, destroy the data, return the results.
And, and we think that we have a pretty good idea around how to do that, and it involves basic you know, public key encryption of both datasets that are the input, a well defined input specification for what goes into a particular DAG and as well as what comes out of it, and then, of course, pretty strict logging and authorization controls around when that DAG is allowed to be run, by whom, and keeping track of each run. So it it's basically like you could imagine spinning up a compute that we, as BrightHive, you know, just can't log into. It's just a a Lambda function or or short lived e c 2 instance performing the calculations on data that's decrypted, and then re encrypting it, the output, for each of the parties who's allowed to access it and then distributing that through the trust. So that sort of approach where the computation is happening on infrastructure that's that's owned by the trust, if you will, sort of gives everybody involved a measure of control and and allows not just the access to the data to be logged, but but the access for what purpose to be logged. Like, oh, the data was accessed, but this particular routine was run on it that you can look at, like the code is available, and then the data is destroyed afterwards and the outputs are exposed, I think is is the kind of scenario that you could imagine working in an environment where everybody trusts each other, sort of, but has really strict compliance requirements that they have to live within, and really strict, you know, relationships and consent agreements with the the people whose data is actually being affected that they that they are worried about, for from a number of different obvious perspectives, public relations and privacy breaches and everything else. Like, we we have to reassure everybody involved that they have a a high degree of control over what's happening and that all the i's are dotted and all the t's are crossed in terms of compliance and that, everything is, obviously, encrypted and handled appropriately according with best practices. And and we think a trust is a good environment in which to to meet all those requirements.
[00:29:47] Unknown:
And then somewhat tangential to the idea of privacy and some of the issues of regulation around that is the idea of ethical uses of the data, which is something that is very subjective and hard to define, specifically using technical guards. And I know that you have some approaches in the legal frameworks that you put forth as to how to identify some of the guidelines of how to make sure that you're using ethical practices in these data analyses. And another element of that is issues pertaining to bias that exists both in the source data and the algorithms that are used to process it. And I'm wondering what your involvement is with the members of the trust to help them identify ethical best practices and ensure that the ways that they're using the data and processing it and, trying to counteract bias in the source sets is up to the sort of current industry best practices.
[00:30:47] Unknown:
And any issues or advice that you have along those lines. Yeah. This is so important, and it it's really again, like, it it's in our DNA to care a lot about this, and it's it's actually driven our our decisions largely around how we grow and who we work with. The the idea of the data trust is broadly the software we've written is, I think, appropriate in a lot of different contexts, we've limited our work so far and the clients that we've taken on largely to the education to work domain, which is 1 that we understand very well. We have a lot of folks on staff with deep subject matter expertise who have the ability to to look at the particular problems that are that the data trusts are trying to solve, identify potential issues of bias, or some sort of other ethical issue that might arise and actually bring their own expertise to bear on it, or at least to be able to issue spot and get questions that might arise up to the governance committee to deal with. Because we've circumscribed the realm in which we're working with our early clients, I think that that has given us a lot of comfort that we're working with these early clients very closely, and we know what they're doing and why we have the in house expertise to be able to spot potential issues. As we grow as a company and as we move beyond the education to work domain, I think this is it's a harder problem to solve unless you wanna basically staff up in in every single domain that may exist. If we were to move into health care, for example, would we have to go and hire folks who have been working in health care IT or in actual health care practice, long enough to be able to, you know, be able to issue Spot with the same sort of rigor that we're able to in the domain that we currently work in. And I think the goal is that as we move on as a company to that to that place where we're working across a bunch of different domains, that we wouldn't have to necessarily, but we probably would continue to offer that as part of our services offering, in a lot of different scenarios.
So it's a combination of making sure that we're staffed to handle the trickiest bits. Like, if we if we start working with health data, having somebody in house, at least 1 person who is able to weigh in on those issues and spot them. But then, also, as as we're building out the software, to be able to use tools like Macie, AWS Macie, and and others to bring some machine learning to bear on this as well to be able to at least flag potential issues like, oh, it looks like this is a gender field. Are are you, are you accounting for the fact that gender can change for some individuals or that they're you know, this shouldn't shouldn't be stored as a binary, or, Oh, hey. It looks like this is a social security number. Are you sure you want to publish this? Things like that can be flagged in an automated way.
I don't think it's sufficient to just rely on automated flagging, which is part of why a governance structure exists in a data trust. You would hope that the decision to publish a particular data resource, if it's being reviewed by by multiple parties who are contributing data to it, that that that review process would highlight a lot of these issues. But given that the data trust idea is new to a lot of folks and the governance structures that we're setting up are still new, we do feel like it's incumbent upon us as as a vendor to keep our own human eyes on a lot of what's happening so that while we're in the process of automating some of these these ethical controls, we have highly trained individuals who are helping us, guide us along that path. Yeah. And I also say too that we have,
[00:34:12] Unknown:
security consultants that are external to our company that we do work a lot with. So oftentimes, especially on the technical and as we're making certain technical decisions, certain architectural decisions, we do consult with, our security team to make sure that these decisions are in line with the best interest of protecting the data of the users that we're entrusted with.
[00:34:37] Unknown:
And then another element of the concept of the data trust is that the organizations that are coming into the trust and sharing their data obviously need to have some sort of technical capacity for being able to maintain their end of the system as far as storing the source datasets, securing it in their own means, updating them according to whatever frequency they need to to make sure that the data is fresh and that it's valuable to every member of the trust. And I'm wondering what you have seen as far as some of the common needs on their end as far as being able to participate in the trust or any challenges that they have as far as fulfilling the technical aspects of being able to be a member and ensure that they have sufficient uptime and availability
[00:35:26] Unknown:
and accuracy in their source data. Yeah. There's there's so much variance, especially if you imagine working with nonprofits, foundations, and government agencies. You can about imagine the the huge variety of of structures that exist in database legacy database systems and and, internal technical staffing models. And it's always the biggest challenge is if you have 3 organizations who've never shared data with 1 another before, they each have different cultures around around the way that they share it, the way that they manage it, the way that they keep it updated, and that that culture shock that happens when you throw the 3 of them into the same room together is something that, again, we rely on humans to actively manage at this point. I would say that in general, our assumption going into this work, which informed our initial thinking around the architecture itself, was that API access to the the various data resources and and then automated ETL processes that posted updates to those APIs was was something that would work for, you know, not every data trust member, but but a large subset of them. And as we've gotten into it, I think we've come to appreciate that there's actually another class of actors for whom that's just not that just is a mismatch between the way that they work with data and the way that they the way that they both report it and use it internally. They'd much rather start from the place of generating a report, uploading an Excel spreadsheet, dealing with flat files, rather than restful APIs.
And so we are adding capabilities to our platform that support those kind of users as well because, as you mentioned, the technical capacity issue is 1 that prevents a lot of data collaboratives from forming or from being sustainable in the first place. And you kinda have to meet the the various members, where they are. You can't necessarily create, data management and data engineering culture, from scratch where none exists just for the sake of of a single collaborative project. You you really have to build tooling that that is appropriate for the audience that that needs to use it. Greg, do you have anything to add to that? No. I think you covered it really well.
[00:37:32] Unknown:
Absolutely. As you said, there is a lot of variance, and we've actually just as an as an organization, we've spent a lot of time for some of our earlier data trusts actually helping them to understand the data that they have in terms of, you know, what the value is and also being able to help them to building processes to actually help them to extract data into the data trust. So that's been a really big part of some of the really close work that we've been doing with a lot of these agencies lately.
[00:38:08] Unknown:
Yeah. We've had we've had customers come to us and say that the process of creating a data trust and the process of working with BrightHive has actually caused them to take a step back and and think more think more broadly about the culture of data sharing and data management across the, across their, various agencies and actually try to to make a a deeper change in the way that they're operating. Even if it's not necessary just for the sake of the data trust, they they feel like they've been able to appreciate all the different approaches that exist out there and, with that broader view, revisit the way that they're they're handling data internally. And that and that's really cool. In order to scale this company, we won't be able to work with every single customer at the level that we are with our early ones. But it's really rewarding to see those early customers be able to to make those connections and, and build on what we're doing, in a way that that just sort of spans, all the work that they do, whether it's collaborative or not.
[00:39:09] Unknown:
So true. We you know, I'll say 1 time, we looking back at a particular customer that we have, seeing them go from this area where it's it was really difficult to get all the data sharing agreements in place to get the players at the table to actually share data with us. And seeing that paradigm suddenly shift from, yeah, it's gonna be difficult to get you this data to, hey, can you add new data that we have into the data trust? You know, that's that that was huge for me. I was like, wow. You know, they've they understood the value of the data trust and what it does for them.
[00:39:50] Unknown:
Another thing that I'm curious about is the life cycle of these trusts that you're working with and whether they are intended to be short lived and only exist for the duration of a particular project, or if it's something that are generally set up as more of a long term engagement where there is no set termination point. And as a corollary to that, I'm curious how you're approaching sustainability of BrightHive itself to ensure that for any of these data trusts that are designed to last in perpetuity to ensure that you're there for the lifetime of that project as a support mechanism and to make sure that the data trust continues to be viable? To take the first question
[00:40:33] Unknown:
first, I think a setting up the full governance structure and, and getting to the point where everybody can sign off in the data trust agreement is time consuming enough and creates enough new entities, if you will, and new relationships that I think it would probably be overkill for just a quick 1 off pilot. I think if you were really just doing something that were strictly time boxed, you wanted to to try a new approach to a particular problem or or generate a particular dataset for a researcher to look at, you probably just sign a 1 off data sharing agreement and do the thing, and then that's that. The idea behind a data trust is that it is extensible and and, governable over the long run, and so sustainability is is, I I think, pretty fundamental to the model, the idea being that in in a lot of these cases, agencies who are affecting the same population or, philanthropies who are serving populations that are also receiving services from from other related groups, and they have a long term need to be able to work together.
And providing the venue for them to do so over the long run is is kinda what we're here for, as opposed to something that's that's more limited in time and scope. But that does raise the issue of vendor lock in and the sustainability of BrightHive as a young company, and and that has driven a lot of our strategy around open sourcing. It it's for us, it's a very tactical decision. I'm not a I'm not a open source fundamentalist, if you will, but, but I am a believer that it it really does have very important uses for businesses, especially young businesses, because it provides the comfort to the clients that if we were to disappear or if a new competitor were to come along who were who is offering something that's that's better and more appropriate, well, they own the data. They have, access to the core of the code, which would allow them to extract it, or even keep the services running, and, therefore, it it reduces the fear in their part that all of a sudden BrightHive disappears and and their software is no longer supported. Like, there is an alternative model where they either internally or or work with 1 of our partner organizations or some other vendor, to make the transition,
[00:42:45] Unknown:
with all the documented source code available to them, and that seems to have helped a lot. And in terms of the types of trust that you've worked with and some of the outcomes of them, I'm curious what you have seen as being the most interesting or innovative or inspirational ways that you have seen the BrightHive platform used as well as this broader concept of a data trust being leveraged.
[00:43:07] Unknown:
I can give 1 really good example of 1 of our data trusts. So the data trust that that I'm gonna talk about a little bit does a lot with workforce development. So, you know, a lot of services that they offer to individuals who might be displaced employees. They may be veterans. You know, they might be looking for just ways to improve themselves. So, you know, you have this group of agencies that are doing a lot of really good and valuable work, but they don't actually have much insight into just how effective the work is that they're doing. So 1 of the eye opening moments for me was after we'd gotten our data trust set up and data started to flow and we had our go live and, you know, they started to get data into the data trust and there was analysis done and, data being used in these 3rd party dashboards and applications.
And I'm sitting in a meeting and 1 of the individuals who is in charge looks at us and says, wow, you know, for the first time, we're actually able to see trends in things. Just, you know, things that might not be very interesting to the wider population. But, you know, here we're seeing this trend where individuals are coming into our service centers between these 2 hours and or we're seeing more applications for services at 2 in the morning, for instance. So just, you know, the the mere fact that they're suddenly unlocking the power of their data to gain insight that they wouldn't have otherwise been able to gain. It it made all the difference in the world for me, and it really was the thing that cemented the reason why I do what I do at Bright House. Tom, do you have anything to add? I think 1 of the most fun things about working here is
[00:45:06] Unknown:
actually going to these government these governance meetings and and seeing the folks who are contributing data to the trust and consuming data from it have these have these really generative moments of, oh, hey. We have these data in in the trust. We could do these thousand other things, and to see the see the wheels start turning. Right? We we've been doing this now, with most of our customers for months, not years. But to see those light bulbs come on and to to hear people talk about the ways that they want to be able to use this trust and and the data that they're putting into it 3 years from now to do something totally transformative.
The fact that people are approaching data collaborative with that sort of spirit of generating new ideas and generating, innovation in the way that services are delivered, that's a it's a weird venue for it in some ways. Right? Usually, data should be supporting that kind of innovation, but not you wouldn't necessarily always think it would be driving it. And yet being able to actually look at outcomes, being able to look at the way that services are being delivered to individuals from a number of different directions seems to open people's eyes in a way and make them step back and think about the bigger picture that it's really conducive to making new connections and and coming up with new ideas.
I have some examples that I could refer to that have happened already, but what I'm really excited about is as these trust relationships between the members of of our, of our data trust solidify and as these new ideas keep coming up, I think we'll we'll actually see some pretty meaningful transformations in the way that services are delivered to to individuals who are just entering the job market for the first time. As agencies who have just never worked together seriously in the past start to collaborate at at an individual level and track their referrals, whether they're working or not, and understand the the various barriers to employment in a truly robust sense where it's not just, oh, this person needs a college degree, but maybe, also, this person doesn't have a car and needs a way to get to their work or is struggling with a mental health problem. To take a to take that sort of holistic approach to case management and to see an individual who a case manager may have been working with for 2 years, but never really understood the core of what was causing them, to to not be able to take the steps that they wanted to take.
That's that's really cool. Like, it it it's, I think, potentially transformative. So I'm looking forward to seeing that materialize over the next months years.
[00:47:43] Unknown:
And as we look forward into the future of BrightHive and the future of how data cooperatives and data trusts are being used as data continues to be central to almost everything we do. I'm wondering what you have planned and some of the trends that you foresee as we move forward. I think 1 of the big trends that we're going to see is is 1 that we're already preparing for, which is the expansion of these data collaboratives beyond
[00:48:10] Unknown:
particular subject matters, let's say, or silos, because, obviously, education to work already includes a number of different stakeholders. It includes public and private education providers, software boot camps, it includes employers, k 12. It's already a pretty large community, but as I mentioned before, we understand that barriers to employment go way, way, way beyond that. And there's so much curiosity and and openness to this notion of kind of whole person care, if you will, now, that's emerged out of research and discussions over the past, I would say, decade or so. Seeing that come to fruition and seeing agencies and philanthropies who are working on seemingly disparate subject matters make connections between them and be able to make better decisions and drive people in in better directions based on something that's happening in a place that's kind of in another silo or another part of the state government altogether, and they just never thought about much before, is, I think, something that's gonna happen. It's coming from the top down as well as the bottom, where folks who are working within these agencies are getting curious about it and and pushing their, pushing their bosses to to, to take that next step and also where political appointees and agency heads are coming in and saying, hey. We need an actual data strategy.
And I think those 2 forces pushing at the problem from either end is is leading to the, the expansion of people's idea of what's possible in terms of collaborating with data. That's thing 1. And I would say thing 2 is this notion of sustainable data governance. To me, this is the biggest problem that is yet to really be solved with technology in the realm of data engineering, is okay. You can build within a single enterprise, enterprise, you know, a data bus, a set of tools for collaborating around data. But once you go beyond a single chain of command and involve a bunch of different stakeholders who are coming at it on, basically, a level playing field where everybody needs to agree and sign off on a decision before it's made. I don't think there's great technological support for those types of organizations right now and and for the ongoing support, governance, and maintenance of those relationships once they're established for a given purpose.
So I see, first of all, the demand, for this type of arrangement, continuing to grow. And then second of all, the the technical
[00:50:42] Unknown:
space for for software tooling to kinda solve that real problem that's preventing those, those relationships from being sustainable and and and and really productive. Are there any other aspects of the work that you're doing at BrightHive or the concept of data trusts and data collaboratives or any of the engineering efforts that you're involved with that we didn't discuss yet that you'd like to cover before we close out the show? I think
[00:51:06] Unknown:
no. I think we did. We pretty much covered the bulk of what, what's sort of on the BrightHive horizon, so I I feel pretty good. I will say, though, that 1 thing that's sort of a that was sort of an eye opener for me over maybe the last 6 months is just as peep people are becoming so much more aware of privacy and especially privacy of their own data. And it's not just necessarily the sort of data elite that are thinking in those terms. So I see that as society in general starts to think more carefully about the data that it's actually sharing. I see this concept of a data trust actually becoming even more and more widespread than it currently is.
[00:51:53] Unknown:
Yeah. Absolutely. And and driven in part by regulations too. Right? The CCPA, the GDPR, like, all those 4 letter acronyms. I I don't think that's the end of the story around data privacy regulation. I think it's just the beginning. So not only are consumers starting to ask questions and citizens starting to ask questions about how their data is being used, but their their representatives are starting to set some some pretty clear boundaries around it. Not entirely clear to me that a company like Palantir, is set up in a way that is compatible, with some of those controls and whereas I think a data trust really can be. So it it it I think I think that the regulatory environment is another thing to keep an eye on as well as citizen expectations.
[00:52:35] Unknown:
Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today, and I'll start with you, Tom. Well, like I said, I think the biggest
[00:52:55] Unknown:
I think that so many of the technical aspects of sharing data, actually have a really robust ecosystem around them. I think the the piece that's missing is is being able to to bring all of it together and make it sustainable, and manageable by the by the data custodians themselves as opposed to relying on everybody collectively going and signing up with a single vendor and having a uniform IT environment across, you know, the entire ecosystem. I think it's largely impractical, especially in in something like the social sector where there's a whole bunch of different actors.
So to me, that that the the glue that connects, data infrastructure provided by multiple vendors and the governance structure that makes it sustainable is is the biggest missing piece right now. If you're a single enterprise with a single decision maker at the top who can say, use this vendor software, I think you can solve a lot of your data management problems that way. Or if you have a large enough IT staff where you can, you know, take the open source tools that are out there and and glue them together, and then you can solve the problem that way. But once you start introducing multiple stakeholders into the into the equation, I think the the technical model and the governance model are not standardized yet, haven't settled on on anything
[00:54:26] Unknown:
tend to agree with Tom as well. Honestly, I feel in terms of our technology, we definitely understand a lot about building and sharing data just, you know, out of practice. You know, we've not just BrightHive, but, you know, as an industry, we've been doing data management for a long time. But, again, I think that the ways of automating the legal and governance part of the data trust management is really something that's gonna be very important moving forward. You know, as we sort of look at things like blockchain and what it tries to bring to the floor. You know, how do we take some of this knowledge of state of say things like smart contracts and apply them to things like data sharing agreements within a data trust. So I'm really excited to not only see how things like those smart contracts, etcetera, will help to shape our thinking, but I'm also excited to be leading a part of the work that we're actually thinking about this stuff as it applies not only
[00:55:32] Unknown:
to BrightHive, but also to just data sharing, data trusts in general. Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing at BrightHive. It's definitely a very interesting area of the overall work being done with data management, and I'm excited to see some of the future work that you do and some of the outcomes of the data trust that you're helping to build. So thank you both for all of your efforts, and I hope you enjoy the rest of your day. Thank you, Tobias. You too. Yep. Thanks for having us. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.
Introduction to BrightHive and Guests
Understanding Data Trusts
BrightHive's Mission and Structure
Data Ownership and Intellectual Property
Technical Architecture of BrightHive
Data Privacy and Secure Computation
Ethical Data Use and Bias
Technical Capacity and Challenges for Trust Members
Lifecycle and Sustainability of Data Trusts
Inspirational Use Cases and Outcomes
Future Trends in Data Trusts and BrightHive's Vision
Closing Remarks and Final Thoughts