Summary
As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- This is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.__init__, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Deon is and your motivation for creating it?
- Why a checklist, specifically? What’s the advantage of this over an oath, for example?
- What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?
- What is the typical workflow for a team that is using Deon in their projects?
- Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen?
- Have you received pushback on any of the default items?
- How does Deon simplify communication around ethics across team boundaries?
- What are some of the most often overlooked items?
- What are some of the most difficult ethical concerns to comply with for a typical data science project?
- How has Deon helped you at Driven Data?
- What are the customer facing impacts of embedding a discussion of ethics in the product development process?
- Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?
- What are your hopes for the future of the Deon project?
Keep In Touch
- Emily
- Peter
- Driven Data
- @drivendataorg on Twitter
- drivendataorg on GitHub
- Website
Picks
- Tobias
- Emily
- Peter
- The Model Bakery in Saint Helena and Napa, California
Links
- Deon
- Driven Data
- International Development
- Brookings Institution
- Stata
- Econometrics
- Metis Bootcamp
- Pandas
- C#
- .NET
- Podcast.__init__ Episode On Software Ethics
- Jupyter Notebook
- Word2Vec
- cookiecutter data science
- Logistic Regression
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. And go to data engineering podcast.com to subscribe to the show. Sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. This is your host, Tobias Macy. And this week, I'm sharing an episode from my other show, podcast dot on it, about a project from driven data called DEON. It is a simple tool that generates a checklist of ethical considerations for the various stages of the life cycle for data oriented projects.
This is an important topic for all of the teams involved in the management and creation of projects that leverage data, so give it a listen. And if you like what you hear, be sure to check out the other episodes at python podcast.com. Your host as usual is Tobias Macy, and today, I'm interviewing Emily Miller and Peter Bull about Dione, an ethics checklist for data scientists. So, Emily, could you start by introducing yourself? Sure. I'm Emily Miller, and I'm a data scientist at DrivenData.
[00:01:24] Unknown:
My background's in international development. And so before becoming a data scientist, I worked in policy research out at the Brookings Institution and out at Stanford University. And I'm really excited to be in the data for good space because I feel like there's so much potential for using data science to help tackle really big societal problems. And Peter, how about yourself? And hi. I'm Peter Bolt. I'm 1 of the cofounders of Driven Data.
[00:01:49] Unknown:
I also lead a lot of our technical work. My background originally is in software engineering. I actually, was a philosophy undergrad major, then went and worked at Microsoft for a number of years. And after that, got sort of more interested in data questions and went back to grad school to do a computational science degree. And while I was in grad school, I was looking to apply this toolkit to social impact problems. I didn't find many good outlets for doing that. And so a couple of my friends from grad school and I started primitive data. And and Emily, going back to you, do you remember how you first got introduced to Python? Yeah. So as I mentioned, I used to work as a research assistant, and I was doing conometrics regressions
[00:02:35] Unknown:
mostly in Stata, which is the proprietary software preferred by, economics professors. And I, over time, decided I didn't wanna stay in research. I didn't wanna get a PhD, but I really loved working with data, and that became kind of the impetus for transitioning into data science. And so based on my Internet research, Python was clearly the language of choice, the language to learn. And so I started teaching myself Python, and then eventually went to a data science boot camp called Metis. And I just remember very distinctly first working with Pandas and realizing how easy and straightforward it was to do data manipulation and just how much easier it was to work with data than in all of the proprietary software.
[00:03:18] Unknown:
So it was a pleasant surprise to to get to work in in Python and have that be part of my daily life now. And R is the 1 of the other big, languages at least in terms of open source for data science. So have you crossed paths with that at all since you started working in the field? I worked with r a little
[00:03:34] Unknown:
bit back in undergrad, actually, for my undergrad thesis, mostly because my thesis adviser recommended working with r. Since working in Python, I haven't really had a need to then switch over to R, so I've just been
[00:03:45] Unknown:
pretty much exclusively working with Python. And, Peter, do you remember how you first got introduced to Python? Yeah.
[00:03:51] Unknown:
So, I was working at Microsoft and using the dot net stack and mostly working in c sharp and c plus plus. I was on I worked on 2 teams there. 1st, Microsoft Office, user experience team, and then Microsoft SharePoint. And I was actually looking for a a language or a technology to learn to do side projects in. I had a couple ideas for sort of small little web apps that I wanted to try building, and I wanted an easy path for building and then deploying those applications that I wanted to do on the side. And it just seemed like Python was a really great fit for that kind of work. So I started building little side projects in Python, and that's really how I got introduced to it. And since then, it has just really grown on me as a language that we can do both our data work in, and a lot of the software engineering work that we do.
[00:04:50] Unknown:
And so you both work now at Driven Data. And as part of the work that you've been doing there, you built this tool called Dion. So can you give a bit of a description about what it is and your motivation for creating it? Yes. Dion
[00:05:04] Unknown:
is a command line tool that generates ethics checklists for data science teams. So what it does is give you 1 simple command that you run as a data scientist that's integrated into your workflow that will add a set of items that really are questions to think about when doing data science work. And we've mapped ethical concerns, 2 different steps in the data science process. And the idea is that as a data scientist, it's hard to pull myself out of my day to day work to ask these questions that I'd like to be asking as part of my process. And Dion is designed to make it easy to say, okay. I'm working in the Python stack. I'm working at the command line.
I wanna make sure we have these ethics discussions. How do I get that started in my project? And so by running a command at the command line, we can generate a markdown file that has these ethical checklist items. And then I can do a pull request to get up and say, okay. Here's our ethics checklist team. Let's make sure that as we go through the course of our project, we're discussing these items and we're all on the same page about the decisions that we make
[00:06:19] Unknown:
in how we think about our data. And in terms of the checklist, after it's generated, are there any other, utilities built into that command line tool to be able to then link back maybe any code comments or code attributes to different items within the checklist, or is it mostly just to be able to form a focal point for ongoing discussion of those concerns within the project? So we've structured it to have,
[00:06:49] Unknown:
reference numbers for each of the sections so that you could easily say, oh, I'm changing this part of how we store the data here in response to checklist item, whatever. So it's relatively easy to refer to each of the items, by either a short name, or by a numbering scheme. There's no technical tie in between the work that you would do and the checklist items. It's really designed to be relatively lightweight in terms of how much process it forces onto its users. But the idea is that by having this checklist as part of the technical workflow, the technical team is encouraged to have these ethical discussions.
[00:07:35] Unknown:
And what are the benefits of having it be formatted as a checklist and be sort of loosely enforced based on the norms and policies of the organization as opposed to being something that's more deeply integrated into the code base and the project life cycle or something, more general and policy oriented, something along the lines of the Hippocratic Oath or, something that the different members of the team are sworn to as part of their employment?
[00:08:09] Unknown:
So I think I'll sort of take that from 2 angles. The first question is why is this different from an oath? And the idea here is that the data science community was having a relatively robust discussion around ethics. So particularly after the Equifax breach and then Cambridge Analytica scandal, lots of people were talking about how do we make sure that these technologies are used in ways that are ethical. And that sort of led to some setups of, okay, we need an oath for data scientists where they can say they're committed to only doing things that are good or are right. And our perspective on that is that that is a good first step. But even if you have those intentions, it can be relatively hard in the moment to do the right thing because it's not necessarily always your top of mind concern. You've got deadlines. You've got PRs to review. You have project managers that are asking you questions.
And so we wanted something that was really actionable and something that could be integrated into the process. So it's not just a 1 time commitment to doing the right thing, but it lets us feel like we have a backstop that makes sure we have these discussions even in the face of other pressures that we have during the course of a project.
[00:09:39] Unknown:
Yeah. 1 of the things that checklists are really great at doing is translating principle into practice. And so it's, as Peter was mentioning, can be a 1 time thing. They be general and vague, such as do no harm. And so the thing that we really like about checklists is that they can be more specific and the fact that they're repeated in their use. Right? And so their repeated use helps make them more targeted. And also, checklists can be more specific. And because they're used repeatedly, they can be more targeted toward execution into action. And they can really help us put these broader aspirations into context in a way that we can, integrate them into our workflows on a more daily basis.
[00:10:18] Unknown:
That's 1 angle on why the checklist is different from an oath, but I also wanted to distinguish it from, a commitment to ethical action that a company may have where they have some processes or they have a corporate social responsibility group, or they have a team thinking about how their company can have an ethical impact, because there are lots of organizations that do that and, end up doing relatively good work in the world. But the reason we wanted something like this checklist is that for a number of these data questions, data scientists actually have a particular perspective that's valuable. Often the question that is being asked is not what's right, but the question that's being asked is, are we doing what we agree is right with this data in a way that we can agree on. And often, it's the case that with data science work, those are relatively tough questions to answer. And you need to do a lot of thinking and discussion to make sure you're even living up to your own goals.
[00:11:27] Unknown:
And I think it's worth noting that a lot of the items on the checklist are intended to provoke discussion. You'll notice that a lot of the items start with phrases like have we sought to address and have we considered because we're working at a level of abstraction here where we can't recommend, you know, remove variable x from y model. And so the idea is that ethics is hard, and it's filled with nuance and trade offs, and so there's not necessarily a clear right answer. And so the goal here is really to spur those discussions and think about what are the trade offs and what are our intentions here? And can we be more explicit in why we've made the decisions that we've made, taking into account those those ethical considerations.
[00:12:07] Unknown:
And the checklist itself is also fairly granular and broken out into different subcategories that encapsulate the various life cycles of a data project. Whereas with the Oath that's more general, it can be difficult to directly tie a specific action to whether it is in line with your goals. And you're not necessarily going to go straight from everything being completely in line with what you're trying to achieve to then being at odds with it. And it's more likely to progress in various small steps and gradations so that you don't necessarily recognize in the moment that you have drifted from your intended goals and your intended mores with how the project is intended to be used. And so after a while, you then come to the realization that through those small steps, you inadvertently have gotten to a place that you're not comfortable with, and then it's difficult to back out of it because of all of the,
[00:13:05] Unknown:
weight of legacy that has been built into the system and the models that have been generated from it. Yeah. And 1 of the reasons why we break the checklist up into the 5 different sections around collection, storage, and analysis, modeling, deployment is that there are particular ethical concerns that come up at those different points. And so things like informed consent is really salient in the data collection process versus looking at bias that's in the data set comes up both and during the analysis. And so by having that separation, I think, allows us to better tackle, like, the full range of concerns rather than trying to do everything up front because it's not possible. Like, all of those issues aren't necessarily coming up ex ante. They're coming up as you go through the data, as you're looking at things, as you're getting to deployment. You have different concerns than you had when you were first collecting the data. And as is noted in the documentation for the Dion project,
[00:13:59] Unknown:
computer science and software engineering and data science are 1 of the few industries that don't have an established professional ethics board or anything along those lines. And it's more left up to the individual contributors and individuals within an organization to try and adhere to what they view as being ethical. And I've had a discussion previously about some of the concerns of ethics in the software engineering context. But what is unique to data science and data projects in terms of those ethical concerns that differentiates it and requires a specific approach for those types of projects as opposed to just traditional software engineering?
[00:14:43] Unknown:
Yeah. I think that's a really good question. And I started to think about this having had a background as a software engineer and really not feeling like the same sort of questions came up for me in the course of my work as a software engineer. So I think that there in my mind, there are 2 things that make these concerns particularly relevant to data scientists. So if you look at the checklist, there are a number, of concerns, particularly the ones about data storage and deployment that are very relevant to software engineers as well. But the 2 differences are that data just is a reflection of the real world in a way that not every software product is. So when you're building software, if you're starting from scratch, you're often at a place where you get to make all of the design decisions.
You get to lay out what your perfect world looks like. Whereas when you're working with data, you've got this reflection of the real world as it exists already, and that's gonna come with biases built into that data that you need to make an effort to understand if you're going to use it effectively. So that's 1 way in which data science ends up being different than software engineering. And the other way is that machine learning is not designed to ask why questions. So machine learning is meant to extract patterns from the data that let us predict thing x or thing y.
And because we're not saying that the machine has to tell us why it's making these decisions, there's inherently more risk in that process. Whereas with software engineering, we're explicitly telling the machine what to do rather than asking it to learn what to do. And when we ask the machine to learn what to do, it may be doing it for reasons we don't agree with, and we need to be able to inspect that. We need to be able to audit that and ask questions about, is that actually how we want to reflect this decision making process?
[00:16:53] Unknown:
And for people who are deciding to use DION within their projects, what is the typical workflow for introducing it and ensuring that it continues to be part of the development cycle as the project progresses?
[00:17:11] Unknown:
So, really, we think of this as a tool for data scientists to get data scientists involved in this discussion. And so the idea is that we want it to be very easy for technical teams to use this. This isn't a product that's designed for project managers or for legal departments. But, really, we think data scientists want to be part of the discussion around ethics and have a particular perspective. And so the idea of the workflow is that, if I'm working on a team, I can actually run a command at the command line, and that will generate a markdown file that has all these check list items. And I can immediately submit a PR with that and say, hey, team. I'm adding our ethics checklist to the project.
Let's make sure that during the course of the project, we look at these items at the relevant stages. And that's 1 way to start integrating it into your process. But even if you're doing a lighter weight, sort of analysis where you're just working in a Jupyter Notebook, you can run a command and the checklist will get appended to the end of your Jupyter Notebook. And you can delete the items that might not be relevant for this particular analysis and just say, hey. I thought about things x, y, and z that are important for the work that we did here, and here's my note that I did think about these. And we can continue to have this discussion if we need to.
[00:18:35] Unknown:
Yeah. I think it's worth noting that NA is a totally acceptable answer. There are projects that may not involve data collection. And so it is totally okay to not have a section be relevant or not have an answer to some of the items on the checklist. Not every item is gonna be relevant to every project.
[00:18:55] Unknown:
And so there there is a default checklist that comes with the project. And given that it's just a markdown document that gets generated, there's room for editing or changing the checklist or adding various additional concerns. So have you seen any cases where people who are using Dion have added or modified those checklist items or any items that you either have received or anticipate pushback on from people who are getting started with Dion?
[00:19:25] Unknown:
Yeah. So we don't have a good way of tracking if people have changed checklist items. So, I don't have any specific examples of that yet, but we just launched the tool about a month ago. So, hopefully, we'll get reports back of usage and the kinds of things that people are adding. But you can see a lot of domain specific changes that people may wanna make. So an easy example is if you work in health care, you may wanna have an item on your checklist that ensures that your data is stored on HIPAA compliant servers. Or if you work in an academic context, you may wanna have an item that documents, how your study went through the institutional review institutional review board and got approved.
So there are domain specific changes that teams will wanna make depending on their context. And it's relatively easy to do that within the tool and come up with your own checklist file that that then you can use in all of your projects in your context for generating this ethics checklist.
[00:20:31] Unknown:
And given that some of the elements in the checklist cover items such as data collection and data storage and the deployment of the resultant models, there's the potential for some of these concerns to span multiple teams within an organization the communication the communication around these ethical items across those team boundaries and across the broader organization?
[00:21:05] Unknown:
So in my mind, a part of this is about just having a common language to talk about ethical concerns. And because we've framed each 1 of the items, with a particular description and a way to refer to it, but also give an examples of where something has gone wrong with that particular checklist item, it makes it easier to have this common language where you can say, oh, we don't want to fall into a scenario where, this kind of thing happens to us. How do we avoid that? So it's partly about just giving a team a language and a context for discussing these concerns. And then beyond that, I think that having a formalized process like this, even if it's a really relatively informal formalized process, gives you a reason to bring up topics that can be hard to bring up other ones. A lot of these discussions are going to be hard discussions to have as a team. There's real work to do in a lot of these checklist items.
But without a reason to bring up those concerns, you may not do it because you know it's gonna be a tough discussion. And so having just a part of your process where those discussions are built in where you know you're expecting them, makes it that much easier to make sure that it happens. And it's not on any 1 person's shoulders to say, okay. We gotta have the ethics discussion now, and no 1 wants to have this discussion with me, but I have to make sure it's happened. It's my responsibility. And it sort of helps to spread that concern out, throughout the team and make it everyone's responsibility and make everyone aware of what that discussion is going to look like.
[00:22:56] Unknown:
And in your experience working on data science projects or interacting with other people who are in a similar space, what are some of the items on the checklist that you have seen to be most commonly overlooked or possibly
[00:23:12] Unknown:
ignored? I think 1 of the most important items on the checklist for me is around dataset bias. And I think it's important to remember that no matter how much effort went into eliminating bias in the data collection or in the pre processing. There will always be bias. We're not going to have a dataset that is perfectly unbiased. And so the question to ask is not is my dataset biased, but how is my dataset biased? Because looking at kind of examining the possible sources of bias is the first step for figuring out what we can do about them. And so what do we mean when we say bias? We're talking about things like stereotype perpetuation and confirmation bias, confounding variables that may be omitted or even having imbalance classes.
So an example of this comes from Word2Vec, and a lot of data scientists will be familiar with this. It's a vector space dataset that captures the relationship between words and the corpus that Google trained the neural network on was essentially all of Google News articles. And so the thing that on the surface people may not realize is that despite this model being trained on a huge dataset and you're looking at this vector space and it's math and you're like, doesn't that mean that it's right? Right? Doesn't that mean it's objective? It's been trained on this huge dataset. And we have to remember that the articles that the model was trained on exist in the context of society. Right. And so they're written by authors who have their own biases. And so these biases get embedded in those word embeddings with word to write, for example. When you start to explore some of these representations, as researchers have done, you come up with things like man is to computer programmer as woman is to homemaker. Right. And so we have clear evidence of these stereotype perpetuations, things like sexism that are showing up in this dataset that on surface may appear to be objective.
The other thing that I've mentioned here is around data collection, and I feel like there's potentially more awareness around bias and data collection. This shows up a lot in statistics of thinking about how to have a representative sample. You're trying to estimate a trait about a larger population from the sample. But I think that identifying bias and data collection can still be really tricky, especially as we start to work more with passively collecting data where we're not going out and surveying people. We're finding we're being able to use data from smartphones, for example.
So to make this more concrete, there is an app called Street Bump, and it tried to passively collect data from smartphones based on the accelerometer and the GPS data to detect particles. And then it would transmit this data to the city so that they could make more data driven decisions about how to direct public resources and help improve the rates. And so on the surface, this seems like a great idea. Right? We're using possibly collected data, so we're not going out and bothering people and asking questions, and we're allowing more data driven, data driven approach to using public resources. But we have to remember that smartphones aren't evenly distributed across the population.
Right. Know that lower income populations and elderly populations are less likely to have cell phones or less likely to have smartphones. And so the effect of an app like this can actually be a diversion of public resources to areas that are younger and richer and may not actually have that equitable use that that was intended.
[00:26:34] Unknown:
And some of those items are fairly difficult in terms of particularly data collection and trying to ensure that you either account for or eliminate bias in that collection and also in terms of identifying the biases within the data set? And what are some of the other concerns within the checklist that are particularly difficult to comply with for a given data
[00:26:57] Unknown:
project? Yeah. I think 1 of the more difficult ethical concerns to address is proxy discrimination. And so you may not have variables in the model specifically on gender or race or age or religion, but you may have other variables that are highly correlated with those and therefore can act as proxies or be otherwise unfairly discriminatory. To give you an example here, there was a recent article just from a few days ago on how Amazon was trying to build a recruiting engine where you could feed in, say, a 100 resumes and it would spit out the top 5 that you should go interview. And it turned out the tool was biased against women. And it's not that there is an explicit gender field of male female passed in to the model, but the model had picked up on things that only women did, like being the captain of the women's chess club or going to an all women's college or even using certain terms that are more associated with females than males. And so you can see how on the surface, you say, okay, well, I don't have this variable. That bias must not exist, but it finds many other ways to sneak in. Give you a second example here around child welfare scrutiny.
1 of the reasons I should say that I have all these examples is that 1 of the things that we did when we released the documentation for DI, which has the installation instructions and things around use in the checklist content, as we also included a page of examples of times where things have gone wrong. And so the idea here was to really benefit from the power of example and to provide at least 1 example for every item on the checklist. And so you can see all these on the website, dion.drivendata.org, and you can dig into some of these stories I'm referencing. But to get back to the story on child welfare scrutiny, it comes from a county in Pennsylvania which was using the algorithm to try to predict child abuse and neglect. And the variables that were used to predict child abuse are often direct measurements of poverty. They're things like the they're things like whether the person has used food stamps or county medical assistance or supplemental security income.
And so the result of this is that you end up getting lower income families unfairly targeted for child welfare screening. To put it in other words, that algorithm essentially confuses parenting while poor with poor parenting. And so it views parents who reach out public programs as risks to their children. And so you can see again how on the surface it may be hard to tell because you're using variables that are good predictors. As Peter was mentioning, you're trying to predict abuse. And so even if you're doing a fairly good job of that, you have to be aware of what are those variables correlated with? What else might they be masking? How might this be having unfair, discriminatory effect?
[00:29:30] Unknown:
And are there any particular cases or projects where you have leveraged Dion within your work at
[00:29:37] Unknown:
DrivenData? Yeah. So this really, we started on this ethics checklist process in building this package in August. And 1 of the other things that we do at Driven Data is we maintain the cookie cutter data science project, which is basically just a project template for data science work. And we use that project template in all of the projects that we do. So because we created Dion as a Python package, it's very easy to script the inclusion of an ethics checklist. And so we now have it. So when you create a data science project, we can automatically run beyond and include this checklist for all of our projects by default without anyone taking that step. Anytime we say, hey. I wanna start a data science project, we now get that checklist embedded in the project. And so it becomes really core to all of the technical
[00:30:43] Unknown:
checklist are largely built around the ethical concerns for the ways that the data is used and the source of the data and the people that are involved in it. What are some of the customer facing impacts for having that checklist embedded in the project?
[00:31:03] Unknown:
Yeah. So, really, I think that the best, customer facing benefits of including something like this is the avoidance of harms. And in particular, if you look through the examples, the kinds of harms that come up through data science projects. So, as we look through the set of examples for why an item is included on the checklist, we can see all kinds of things that go wrong that you might not expect to happen, or it might just happen if you're not paying attention or you're forced to a deadline, even if it's something that, as a team, you didn't intend to happen. So, really, the avoidance of those kinds of harms is the first order benefit to customers.
But I think there's also a bigger picture benefit just in terms of getting data scientists that have particular perspective on the ethical items that are on this checklist more and more involved in that discussion and bringing it to the forefront of their minds when they think about their work. And what that means, I think, is that we have a broader base of understanding of the implications of data science and AI work across the profession, and that's going to have benefits going forward as we think about the interaction of AI, machine learning, and data science with society.
[00:32:39] Unknown:
And as you were compiling the list of items to go into the checklist, are there any that you considered and ultimately decided not to include? And what were some of the either more contentious or what were some of the items in the list that generated the most discussion as far as what to include?
[00:33:00] Unknown:
There was 1 item that we considered including but ultimately didn't, which was thinking about team composition. And so this is ultimately captured in missing perspectives of thinking about, are we engaging with stakeholders recognizing that any team is going to have blind spots? For example, it driven data. We work on projects spanning the gamut from financial inclusion to energy to health. And so we obviously aren't experts in those areas. And so through our work in all of these areas, when we're not subject matter experts, we're really careful to engage with our partners to make sure that we have the correct definitions and the correct interpretation of the data and that we're choosing metrics that capture the things that we want to be impacting. And so 1 of the reasons why we didn't include a bullet directly on team composition is that this is a broader issue in how we build data science teams and how we recruit them. And so we wanted the checklist to really be project specific.
And so in each project, we'd be thinking about that work versus the kind of broader essentially, the broader composition of the company.
[00:34:03] Unknown:
Yeah. And I I think that part of this discussion was how broad do we go in terms of what goes on in this checklist. And for things that we felt like were an organization's concerns rather than a data scientist or data science team's concerns. We tried not to include things that should be addressed at that level in the checklist. We really wanted these to be actionable and productive for working data scientists to engage with And anything that felt like the responsibility of an organization at an organizational level isn't part of this data ethics checklist that we've created for a project basis, but, obviously, should be something that these teams and organizations are thinking about and engaging in. Just we can't solve everyone's problems at once with 1 checklist.
[00:34:59] Unknown:
And some of the items on the checklist directly coincide with various regulatory requirements. But are there any cases where you have seen regulation that conflicts with some of the items on the checklist and some of the ethical concerns that you would like to see promoted in data project? I think whether it's directly regulated or not, there are certain industries like health health care that are tending to favor explainable
[00:35:24] Unknown:
models over deep learning models per se. And this is largely for ethical considerations. So to give you an example here, this is what I thought was really interesting pertaining to the health care industry. There is a study done that was trying to predict the the probability of death for pneumonia patients. The idea that high risk patients could be admitted to the hospital and low risk patients could be sent home. So a deep learning model was trained and as researchers were examining it, they found that the model predicted that asthmatics had a very low risk of dying and could therefore be sent home. And this is a very weird, very odd finding because in practice, asthmatic patients with pneumonia are typically admitted directly to the ICU, the intensive care unit, because they have a very high risk of dying. What had happened was that given the success of the intensive care unit, the model had incorrectly associated asthma with low risk.
And so it was actually thanks to an explanatory model, a logistic regression, I think it was in this case, that this issue was identified, thankfully before the model was ever put into practice. But you can see how without that ability to understand why the model was making the predictions that that it did, in this case, you know, it could have been recommending patients to be sent home essentially to die. And beyond
[00:36:33] Unknown:
the project specific elements that are useful to be considered as technologists are working through a given project. What are some of the broader or more organization wide ethical principles that you would like to see practiced more widely or that don't often get the attention that they deserve? I think that it's been pretty clear over the last couple years that
[00:36:59] Unknown:
the technology industry as a whole needs to be more engaged in the process of governance, more engaged with both their local, state, and federal governments in asking, 1, how can we help solve problems that really matter? And then, 2, how do we go about building products that people want in a way that improves their lives and does not have harms to those citizens? And if you take a look at the Facebook hearings that happened after the Cambridge Analytica scandal, the level of discussion and nuance that happened during those debates was relatively abstract.
So it was pretty general discussion about, oh, what's your business model? You sell ads. Are you selling people's data? And it didn't get to the point where you could have a discussion about how you can build a company and also take into account the kinds of concerns for its citizens that a government should have. So there needs to be a much broader push towards engagement between technology companies and governments because that's the only way to stop this friction that we're having and also to build a bridge and a common language for discussing issues. Because a lot of these technology companies are important economic drivers, and we should be able to both support them, and protect citizens from some of these harms, intent or not.
[00:38:37] Unknown:
And what are some of your hopes for the future of the Dion project in terms of the outcomes that you'd like to see or any future development or tooling that you would like to see done either on the project directly or in tandem with it? Well, my hope is that DION becomes widely used,
[00:38:55] Unknown:
and that it helps data scientists engage directly 1 of the exciting things about our work is that the tools that we build affect real people and they affect real lives. And because of that, I think it's particularly important for us to be proactive in working to ensure that our work has the positive impacts that we intend.
[00:39:20] Unknown:
And I'm gonna take things in a much more technical direction and say there are a couple of things that I think we could use some help with that maybe listeners could contribute on. 1 of them is, we'd like to support a rich text format or at least a way where you can generate a checklist that goes to the clipboard, but then you could paste into a word document or a Google document that has reasonable formatting. And, actually, that problem is a a stickier technical problem than we expected it to be. And so if anyone wants to dig into that, I think it's pretty interesting to look at that across operating systems and get support for text with formatting that you can copy and paste. So that's 1 very technical thing where some help would be awesome. And then the other 1 is we're helping to build just a really small web application around this command line tool so that if you're not a data scientist and you wanna grab 1 of these checklists to drop into a project, say, you are a project manager, but you wanna have this as part of your process. There's a place where you can easily go and generate a checklist.
And so just a small application that wraps the existing functionality is another place where we could see people contributing and helping out and helping to spread usage of the tool. So those are 2 very concrete things where we'd love to see some help. And then beyond that, we're very willing to engage in discussions around what's included, why things are included, what exactly they mean. And so, if people have questions about that or wanna have those discussions, I think that we're a very open and welcoming project to
[00:41:02] Unknown:
start those conversations because we think those are conversations worth having. Yeah. I'd also add that, as I mentioned, we have this table of examples of times where things have gone wrong. And 1 of the things we're also thinking about compiling is a list of best practices, right, where things were times where things have gone right. And that can be a bit tricky because as we were talking about earlier, these things, the checklist content items can have different meanings and different contexts, and there's not necessarily a 1 size fits all solution. But I do think that it could be helpful to have some ways to point people in the right direction. These are things that worked in this context, or you might think about this set of things. So I'd also encourage listeners if they had recommendations.
I'd also encourage listeners if they have recommendations on things that have worked and things that they're doing that seem to address some of these issues well, we'd love to hear this. And do you have any thoughts on maybe creating some sort of a badge that open source projects can include in their documentation
[00:42:00] Unknown:
or that organizations or websites can put on to espouse their support for and possibly compliance with the various ethical principles that you're trying
[00:42:12] Unknown:
to help reinforce with the Dion project. We are hoping that you will open that issue on the repo so that us or some other contributor could get started on it. I think that's a great idea to have 1 of those batches.
[00:42:25] Unknown:
And so, yeah, please. I think that would be awesome to be able to label projects as including an ethics checklist. Let's do it. And so for anybody who wants to follow the work that you're up to and find out more about what you're doing at Driven Data. I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose an artist who I met recently at a craft fair up in Vermont. His name is Richard Bond, and he does these amazing pieces of work where he takes sheets of multicolored glass and then does very detailed etching with a sandblaster to reveal some of the different layers within the glass and give it depth and perspective. And, his work is very beautiful and just incredibly detailed given the materials that he's working with, so I highly recommend checking him out. And so with that, I'll pass it to you, Emily. Do you have any picks this week? Yeah. That sounds super cool.
[00:43:20] Unknown:
My pick this week is that I was in Portland, Maine last week, and I had some of the best espresso that I've had in a long time at a coffee shop called Tandem Coffee. I love exploring coffee shops. And when I'm in a new place, I usually try to try a bunch of different coffee shops, and I ended up going to Tandem, like, 3 out of the 5 days that I was there. It was just so good. So 1 other thing to note on Tandem is they also have a variety of baked goods. If those are your thing,
[00:43:45] Unknown:
I'd highly recommend their ginger molasses cookie if you're there. Alright. And, Peter, do you have any picks this week? Yeah. Well, I'm gonna jump on the food trade, I think, and give a West Coast option for people who, are out on the West Coast because I had and this might sound hard to believe, but I had a life changing English muffin the other day. And I know that sounds crazy, but this is real. There's a place called the Model Bakery, and they have locations in Saint Helena and Napa, California. And I was in there. It was a beautiful sunny fall day. It was a Friday afternoon.
I had just finished a big piece of work that I was doing, so I was in a great mood already. So I might have been primed for this English muffin to be life changing. But I went into the model bakery. I had a little bit of lunch, and then I said, ah, I want a little more, a little something after lunch. And I got their apple cinnamon English muffin, And I just ate it without anything on it, not toasted or anything in the sunlight on a beautiful fall day after finishing some work, and it was absolutely incredible. So if you could get up and try 1 of these English muffins, I promise that it will change your conception
[00:45:06] Unknown:
of what an English muffin is. Alright. Well, thank you both very much for taking the time today to discuss your work on Dione. It's definitely an interesting tool and supports some important conversations, so I appreciate that. And I hope you enjoy the rest of your day. Thanks. You too. Thanks for having us.
Introduction and Episode Overview
Meet the Guests: Emily Miller and Peter Bull
Emily Miller's Background and Introduction to Python
Peter Bull's Background and Introduction to Python
Introduction to DEON: An Ethics Checklist Tool
The Importance of Ethics in Data Science
Checklist Structure and Ethical Considerations
Integrating DEON into Data Science Workflows
Communication and Ethical Discussions Across Teams
Commonly Overlooked Ethical Concerns
Challenges in Addressing Ethical Concerns
Using DEON in DrivenData Projects
Customer-Facing Impacts of Ethical Checklists
Broader Organizational Ethical Principles
Future Development and Hopes for DEON
Best Practices and Community Contributions
Closing Remarks and Picks