Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__)

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

And go to data engineering podcast.com

to subscribe to the show. Sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. This is your host, Tobias Macy. And this week, I'm sharing an episode from my other show, podcast dot on it, about a project from driven data called DEON.

It is a simple tool that generates a checklist of ethical considerations for the various stages of the life cycle for data oriented projects.

This is an important topic for all of the teams involved in the management and creation of projects that leverage data, so give it a listen. And if you like what you hear, be sure to check out the other episodes at python podcast.com.

Your host as usual is Tobias Macy, and today, I'm interviewing Emily Miller and Peter Bull about Dione, an ethics checklist for data scientists. So, Emily, could you start by introducing yourself? Sure. I'm Emily Miller, and I'm a data scientist at DrivenData.

My background's in international development. And so before becoming a data scientist, I worked in policy research out at the Brookings Institution and out at Stanford University. And I'm really excited to be in the data for good space because I feel like there's so much potential for using data science to help tackle really big societal problems. And Peter, how about yourself? And hi. I'm Peter Bolt. I'm 1 of the cofounders of Driven Data.

I also

lead a lot of our technical work. My background originally is in software engineering. I actually,

was a philosophy undergrad

major, then went and worked at Microsoft for a number of years. And after that, got sort of more interested in data questions and went back to grad school to do a computational science degree.

And while I was in grad school, I was looking to apply this toolkit to social impact problems.

I didn't find many good outlets for doing that. And so a couple of my friends from grad school and I started primitive data. And and Emily, going back to you, do you remember how you first got introduced to Python? Yeah. So as I mentioned, I used to work as a research assistant, and I was doing conometrics regressions

mostly in Stata, which is the proprietary software preferred by, economics professors.

And

I, over time, decided I didn't wanna stay in research. I didn't wanna get a PhD, but I really loved working with data, and that became kind of the impetus

for transitioning into data science.

And so based on my Internet research,

Python was clearly the language of choice, the language to learn. And so I started teaching myself

Python,

and then eventually went to a data science boot camp called Metis. And I just remember very distinctly

first working with Pandas and realizing how easy and straightforward it was to do data manipulation and

just how much easier it was to work with data than in

all of the proprietary software.

So it was a pleasant surprise to to get to work in in Python and have that be part of my daily life now. And R is the 1 of the other big, languages at least in terms of open source for data science. So have you crossed paths with that at all since you started working in the field? I worked with r a little

bit back in undergrad, actually, for my undergrad thesis, mostly because my thesis adviser recommended

working with r.

Since working in Python, I haven't really had a need to then switch over to R, so I've just been

pretty much exclusively working with Python. And, Peter, do you remember how you first got introduced to Python? Yeah.

So,

I was working at Microsoft and using

the dot net stack and mostly working in c sharp and c plus plus.

I was on I worked on 2 teams there. 1st, Microsoft

Office, user experience team,

and then Microsoft SharePoint.

And I was actually looking for a a language

or a technology to learn to do side projects

in. I had a couple ideas for sort of small little web apps that I wanted to try building, and I wanted an easy path for building and then deploying those applications that I wanted to do on the side. And it just seemed like Python was a really great fit for that kind of work. So I started building little side projects in Python, and that's really how I got introduced to it. And since then, it has just really grown on me as a language that we can do both our data work in,

and a lot of the software engineering work that we do.

And so

you both work now at Driven Data. And as part of the work that you've been doing there, you built this tool called Dion. So can you give a bit of a description about what it is and your motivation for creating it? Yes. Dion

is a command line tool

that generates

ethics checklists

for data science teams.

So what it does is give you 1 simple command that you run as a data scientist that's integrated into your workflow

that will add a set of items

that really are

questions to think about when doing data science work. And we've mapped ethical concerns,

2 different steps in the data science process.

And the idea is that as a data scientist, it's hard to pull myself out of my day to day work to ask these questions

that I'd like to be asking as part of my process.

And Dion is designed to make it easy to say, okay. I'm working in the Python stack. I'm working at the command line.

I wanna make sure we have these ethics discussions.

How do I get that started in my project? And so by running a command at the command line, we can generate a markdown file that has these ethical checklist items. And then I can do a pull request to get up and say, okay. Here's our ethics checklist team.

Let's make sure that as we go through the course of our project, we're discussing these items and we're all on the same page about the decisions that we make

in how we think about our data. And in terms of the checklist, after it's generated,

are there any other,

utilities built into that command line tool to be able to then

link back maybe any code comments or code attributes to different items within the checklist, or is it mostly just

to be able to form a focal point for ongoing discussion of those concerns within the project? So we've structured it to have,

reference numbers for each of the sections

so that you could easily say,

oh, I'm changing this part of how we store the data here in response to checklist item,

whatever.

So it's relatively easy to refer to each of the items,

by either a short name,

or by a numbering scheme. There's no technical tie in between the work that you would do and the checklist items.

It's really designed to be relatively

lightweight in terms of how much

process it forces onto its users.

But the idea is that

by having this checklist as part of the technical workflow,

the technical team is encouraged to have these ethical discussions.

And what are the benefits of having it be formatted as a checklist

and be sort of loosely enforced

based on the

norms and policies of the organization

as opposed to being something that's more deeply integrated into the code base and the project life cycle

or something,

more general

and policy oriented,

something along the lines of the Hippocratic Oath or,

something that the different members of the team are

sworn to as part of their employment?

So

I think I'll sort of take that from

2 angles. The

first question is why is this different from an oath?

And

the idea here is that the data science community was having

a relatively robust discussion around ethics.

So particularly

after the Equifax breach and then Cambridge Analytica scandal, lots of people were talking about how do we make sure that these technologies

are used in ways that are ethical.

And

that sort of led to

some setups of, okay, we need an oath for data scientists where they can say they're committed to only doing things that are good or are right.

And our perspective

on that is that

that is a good first step. But even if you have those intentions,

it can be relatively

hard in the moment

to do the right thing because it's not necessarily always your top of mind concern. You've got deadlines. You've got PRs to review. You have project managers that are asking you questions.

And so we wanted something that was really actionable

and something that could be integrated into the process. So it's not just a 1

time commitment to doing the right thing, but it lets us feel like we have a backstop

that makes sure we have these discussions

even in the face of other pressures that we have during the course of a project.

Yeah. 1 of the things that checklists are really great at doing is translating principle into practice. And so it's, as Peter was mentioning, can be a 1 time thing. They be general and vague,

such as do no harm. And so the thing that we really like about checklists is that they can be more specific

and the fact that they're repeated in their use. Right? And so their repeated use helps make them more targeted. And also,

checklists can be more specific. And because they're used repeatedly, they can be more targeted toward execution into action. And they can really help us put these broader aspirations into context in a way that we can,

integrate them into our workflows on a more daily basis.

That's 1 angle on why

the checklist is different from an oath, but I also wanted to distinguish it from,

a commitment to ethical action that a company may have where they have some processes or they have a corporate social responsibility group,

or they have a team thinking about how their company can have an ethical impact,

because there are lots of organizations that do that and,

end up doing

relatively good work in the world. But

the reason we wanted something like this checklist

is that

for a number of these data questions, data scientists actually have a particular perspective that's valuable.

Often the question that is being asked is not what's right,

but the question that's being asked is,

are we doing what we agree is right

with this data

in a way

that we can agree on. And

often, it's the case that

with data science work,

those are relatively tough questions to answer. And you need to do a lot of thinking and discussion to make sure you're even living up to your own goals.

And I think it's worth noting that a lot of the items on the checklist are intended to provoke discussion.

You'll notice that a lot of the items start with phrases like have we sought to address and have we considered

because we're working at a level of abstraction here where we can't recommend, you know, remove variable x from y

model. And so

the idea is that ethics is hard, and it's filled with nuance and trade offs, and so there's not necessarily a clear right answer. And so the goal here is really to spur those discussions

and think about what are the trade offs and what are our intentions here? And can we be more explicit

in why we've made the decisions that we've made, taking into account those those ethical considerations.

And

the checklist itself is also fairly granular and broken out into different subcategories

that encapsulate

the various life cycles of a data project. Whereas with the Oath that's more general, it can be difficult to directly tie a specific action to whether it is in line with your goals.

And you're not necessarily going to go straight from everything being completely in line with

what you're trying to achieve to

then being at odds with it. And it's more likely to

progress in

various small steps and gradations so that you don't necessarily recognize in the moment that you have drifted from your intended goals and your intended

mores with how the project is intended to be used. And so

after a while, you then come to the realization that through those small steps, you

inadvertently have gotten to a place that you're not comfortable with, and then it's difficult to back out of it because of all of the,

weight of legacy that has been built into the system and the models that have been generated from it. Yeah. And 1 of the reasons why we break the checklist up into the 5 different sections around collection, storage, and analysis, modeling, deployment

is that there are particular ethical concerns that come up at those different points. And so things like informed consent is really salient in the data collection process versus

looking at bias that's in the data set comes up both

and during the analysis.

And so by having that separation, I think, allows us to better tackle, like, the full range of concerns rather than trying to do everything up front because it's not possible. Like, all of those issues aren't necessarily coming up ex ante. They're coming up as you go through the data, as you're looking at things, as you're getting to deployment. You have different concerns than you had when you were first collecting the data. And as is noted in the documentation for the Dion project,

computer science and software engineering and data science are 1 of the few industries that don't have an established

professional

ethics board or anything along those lines. And it's more left up to the individual contributors

and individuals

within an organization

to try and adhere to what they view as being ethical.

And I've had a discussion previously about some of the concerns of ethics in the software engineering context. But what is unique to data science and data projects in terms of those ethical concerns that

differentiates it and

requires

a specific approach for those types of projects as opposed to just traditional software engineering?

Yeah. I think that's a really good question.

And I started to think about this having had a background

as a software

engineer

and really not feeling like the same sort of questions came up for me in the course of my work as a software engineer.

So I think that

there in my mind, there are 2 things that

make these concerns particularly relevant to data scientists. So if you look at the checklist,

there are a number,

of concerns, particularly the ones about data storage and deployment

that are very relevant to software engineers as well.

But the 2 differences are that

data just is a reflection

of the real world in a way that not every software product is. So when you're building software,

if you're starting from scratch, you're often at a place where you get to make all of the design decisions.

You get to

lay out what your perfect world looks like. Whereas when you're working with data, you've got this reflection of the real world as it exists already, and that's gonna come with

biases built into that data that you need to make an effort to understand if you're going to use it effectively.

So that's 1 way in which data science ends up being different than software engineering.

And the other way is that

machine learning

is not designed to ask why questions.

So

machine learning is meant to

extract patterns from the data that let us predict

thing x or thing y.

And

because we're not saying that the machine has to tell us why it's making these decisions,

there's inherently more risk in that process.

Whereas with software engineering, we're explicitly telling the machine

what to do rather than asking it to learn what to do. And when we ask the machine to learn what to do, it may be doing it for reasons we don't agree with, and we need to be able to inspect that. We need to be able to audit that and ask questions about, is that actually

how we want to reflect this decision making process?

And for people who are

deciding to use DION within their projects,

what is the typical workflow for introducing it and ensuring that it continues to be part of the development cycle

as the project progresses?

So,

really, we think of this as a tool for data scientists

to get data scientists involved in this discussion.

And so the idea is that we want it to be very easy for technical teams to use this. This isn't a product that's designed for project managers

or for legal departments.

But, really, we think data scientists

want to be part of the discussion around ethics and have a particular perspective.

And so the idea of the workflow is that,

if I'm working on a team,

I can actually run a command at the command line, and that will generate a markdown file that has all these check list items. And

I can immediately submit a PR with that and say, hey, team. I'm adding our ethics checklist

to the project.

Let's make sure that

during the course of the project, we look at these items at the relevant stages.

And that's 1 way to start integrating it into your process.

But even if you're doing a lighter weight,

sort of analysis where you're just working in a Jupyter Notebook,

you can run a command and the checklist will get appended to the end of your Jupyter Notebook. And you can delete the items that might not be relevant for this particular analysis

and just say, hey. I thought about things x, y, and z that are important for the work that we did here, and here's my note that I did think about these. And we can continue to have this discussion if we need to.

Yeah. I think it's worth noting that

NA is a totally acceptable answer. There are projects that may not involve data collection. And so

it is totally okay to not have a section be relevant or not have an answer to some of the items on the checklist.

Not every item is gonna be relevant to every project.

And so there there is a default checklist

that comes with the project.

And given that it's just a markdown document that gets generated, there's room for editing or changing the checklist or adding various additional concerns.

So have you seen any cases where people who are using Dion have added or modified those checklist items

or any items that you either have received or anticipate pushback on from people who are getting started with Dion?

Yeah. So we don't have a good way of tracking if people have changed checklist items. So,

I don't have any

specific examples of that yet, but we just launched the tool about a month ago. So, hopefully, we'll get reports back of usage and the kinds of things that people are adding.

But you can see a lot of

domain specific changes that people may wanna make. So

an easy example is if you work in health care, you may wanna have an item on your checklist that ensures that your data is stored on HIPAA compliant servers.

Or if you work in an academic context, you may wanna have an item that documents,

how your study went through the institutional review institutional review board and got approved.

So there are domain specific

changes that teams will wanna make depending on their context.

And it's relatively easy to do that within the tool and come up with your own checklist file that that then you can use in all of your projects in your context for generating this ethics checklist.

And

given that some of the elements in the checklist cover

items such as data collection and data storage

and the deployment of the resultant models, there's the potential for

some of these concerns to span multiple teams within an organization

the communication

around these ethical items across those team boundaries and across the broader organization?

So

in my mind, a part of this is about

just having a common language

to talk about ethical concerns.

And because we've framed

each 1 of the items,

with a particular description and a way to refer to it, but also give an examples of where something has gone wrong

with that particular checklist item,

it makes it easier to have this common language where you can say, oh, we don't want to fall into a scenario where,

this kind of thing happens to us. How do we avoid that? So it's partly about just giving a team a language and a context for discussing these concerns.

And then

beyond that,

I think that having a formalized

process like this,

even if it's a really relatively informal formalized process,

gives you a reason to bring up topics that can be hard to bring up other ones. A lot of these discussions are going to be hard discussions to have as a team. There's real work to do in a lot of these checklist items.

But without a reason

to bring up those concerns,

you may not do it because you know it's gonna be a tough discussion. And so having just a

part of your process

where those discussions are built in where you know you're expecting them,

makes it that much easier to make sure that it happens. And it's not on any 1 person's shoulders to say, okay. We gotta have the ethics discussion now, and no 1 wants to have this discussion with me, but I have to make sure it's happened. It's my responsibility.

And it sort of helps

to spread that concern out,

throughout the team and make it everyone's responsibility

and make everyone aware of what that discussion is going to look like.

And in your experience

working on data science projects or interacting with other people who are in a similar space,

what are some of the items on the checklist that you have seen to be most commonly

overlooked

or

possibly

ignored? I think 1 of the most important items on the checklist for me is around dataset bias.

And I think it's important to remember that no matter how much effort went into

eliminating bias in the data collection or in the pre processing. There will always be bias. We're not going to have a dataset that is perfectly unbiased. And so the question to ask is not is my dataset biased, but how is my dataset biased?

Because

looking at

kind of examining the possible sources of bias is the first step for figuring out what we can do about

them. And so what do we mean when we say bias? We're talking about things like stereotype perpetuation

and confirmation bias,

confounding variables that may be omitted or even having imbalance classes.

So an example of this comes from Word2Vec, and a lot of data scientists will be familiar with this.

It's a vector space dataset

that captures the relationship between words

and the corpus that Google trained the neural network on was essentially all of Google News articles.

And so the thing that

on the surface people may not realize is that despite this model being trained on a huge dataset and

you're looking at this vector space and it's math and you're like, doesn't that mean that it's right? Right? Doesn't that mean it's objective? It's been trained on this huge dataset. And we have to remember that the articles that the model was trained on exist in the context of society.

Right. And so they're written by authors who have their own biases. And so these biases

get embedded in those word embeddings

with word to write, for example. When you start to explore some of these representations,

as researchers have done, you come up with things like man is to computer programmer as woman is to homemaker. Right. And so we have clear evidence of these stereotype perpetuations, things like sexism that are showing up in this dataset that on surface may appear to be objective.

The other thing that I've mentioned here is around data collection, and I feel like

there's

potentially more awareness around

bias and data collection. This shows up a lot in statistics of thinking about how to have a representative sample. You're trying to estimate a trait about a larger population from the sample. But I think that identifying bias and data collection can still be really tricky, especially as we start to work more with passively collecting data where we're not going out and surveying people.

We're finding we're being able to use data from smartphones, for example.

So to make this more concrete,

there is an app called Street Bump, and it tried to passively collect data from smartphones

based on the accelerometer and the GPS data to detect particles. And then it would transmit this data to the city so that they could make more data driven decisions about how to direct public resources and help improve the rates.

And so on the surface, this seems like a great idea. Right? We're using possibly collected data, so we're not going out and bothering people and asking questions, and we're allowing more data driven, data driven approach to using public resources.

But we have to remember that smartphones aren't evenly distributed across the population.

Right. Know that lower income populations and elderly populations are less likely to have cell phones or less likely to have smartphones.

And so the effect of an app like this can actually be a diversion of public resources to areas that are younger and richer

and may not actually have that equitable use that that was intended.

And

some of those items are fairly difficult

in terms of

particularly data collection and trying to ensure that you

either account for or eliminate bias in that collection and also in terms of identifying the biases within the data set? And what are some of the other concerns within the checklist that are particularly difficult to comply with for a given data

project? Yeah. I think 1 of the more difficult ethical concerns to address is proxy discrimination.

And so you may not have variables in the model specifically on gender or race or age or religion, but you may have other variables that are highly correlated with those and therefore can act as proxies or be otherwise unfairly discriminatory.

To give you an example here, there was a recent article just from a few days ago on how Amazon was trying to build a recruiting engine where you could feed in, say, a 100 resumes and it would spit out the top 5 that you should go interview. And it turned out the tool was biased against women.

And it's not that there is an explicit gender field of male female passed in to the model, but the model had picked up on things that only women did, like being the captain of the women's chess club or going to an all women's college or even using certain terms that are more associated with females than males. And so you can see how

on the surface, you say, okay, well, I don't have this variable. That bias must not exist, but it finds many other ways to sneak in. Give you a second example here around child welfare scrutiny.

1 of the reasons I should say that I have all these examples is that 1 of the things that we did when we released the documentation for DI,

which has the installation instructions and things around use in the checklist content, as we also included a page of examples of times where things have gone wrong. And so the idea here was to really benefit from the power of example and to

provide at least 1 example for every item on the checklist. And so you can see all these on the website, dion.drivendata.org,

and you can dig into some of these stories I'm referencing. But to get back to the story on child welfare scrutiny, it comes from a county in Pennsylvania which was using the algorithm to try to predict child abuse and neglect. And the variables that were used to predict child abuse

are often direct measurements of poverty. They're things like the they're things like whether the person has used food stamps or county medical assistance or

supplemental security income.

And so the result of this is that you end up getting lower income families unfairly targeted for child welfare screening.

To put it in other words, that algorithm essentially confuses parenting while poor with poor parenting.

And so it views parents who reach out public programs as risks to their children.

And so you can see again how on the surface it may be hard to tell because you're using variables that are good predictors. As Peter was mentioning, you're trying to predict abuse. And so even if you're doing a fairly good job of that, you have to be aware of what are those variables correlated with? What else might they be masking? How might this be having

unfair,

discriminatory effect?

And are there any particular cases or projects

where you have leveraged Dion within your work at

DrivenData? Yeah. So this really,

we started on this ethics checklist process in building this package in

August.

And 1 of the other things that we do at Driven Data is we maintain the cookie cutter data science project,

which is basically just a project template for data science work.

And we use that project template

in all of the projects that we do.

So

because we created Dion as a Python package, it's very easy

to

script the inclusion of an ethics checklist.

And so

we now have it.

So when you create a data science project, we can automatically run beyond and include this checklist for all of our projects

by default without anyone taking that step. Anytime we say, hey. I wanna start a data science project, we now get that checklist

embedded in the project. And so it becomes really core to all of the technical

checklist

are largely

built around the ethical concerns for the ways that the data is used and the source of the data and the people that are involved in it. What are some of the customer facing impacts

for

having that checklist embedded in the project?

Yeah. So,

really, I think that the

best,

customer facing benefits of including something like this is the avoidance

of

harms.

And in particular,

if you look through the examples, the kinds of harms that come up through data science projects. So,

as we look through the

set of examples for why an item is included

on the checklist,

we can see all kinds of things that go wrong that you might not

expect to happen,

or it might just happen if you're not paying attention or you're forced to a deadline,

even if it's something that, as a team, you didn't intend to happen.

So,

really, the avoidance of those kinds of harms is the first order benefit to customers.

But I think there's also a bigger picture benefit

just in terms of getting data scientists that have particular perspective

on the ethical items that are on this checklist

more and more involved in that discussion

and bringing it to the forefront of their minds when they think about their work.

And what that means, I think, is that we have

a broader base

of understanding of the implications

of data science and AI work across the profession,

and that's going to have benefits going forward as we think about

the interaction

of AI, machine learning, and data science with society.

And as you were

compiling the list of items to go into the checklist, are there any that you considered and ultimately decided not to include?

And what were some of the

either more contentious

or what were some of the items in the list that generated the most discussion as far as what to include?

There was 1 item that we considered including but ultimately didn't, which was thinking about team composition.

And so this is ultimately captured

in missing perspectives of thinking about, are we engaging with stakeholders

recognizing that any team is going to have blind spots? For example, it driven data. We work on projects spanning the gamut from financial inclusion to energy to health. And so

we obviously aren't experts in those areas. And so through our work in all of these areas, when we're not subject matter experts,

we're really careful to engage with our partners to make sure that we have the correct definitions and the correct interpretation of the data and that we're choosing metrics that capture the things that we want to be impacting. And so 1 of the reasons why we didn't include a bullet directly on team composition is that this is a broader

issue in how we build data science teams and how we recruit them. And so we wanted the checklist to really be project specific.

And so in each project, we'd be thinking about

that work versus the kind of broader

essentially, the broader composition of the company.

Yeah. And I I think that part of this discussion

was

how broad do we go in terms of what goes on in this checklist. And

for things that

we felt like were an organization's

concerns rather than a data scientist or data science

team's concerns.

We tried not to include things that should be addressed at that level

in the checklist. We really wanted these to be actionable

and productive

for working data scientists

to engage with

And anything that felt like the responsibility

of an organization at an organizational

level

isn't part of this data ethics checklist that we've created

for a project basis, but, obviously, should be something that these teams and organizations

are thinking about and engaging in. Just we can't solve everyone's problems at once with 1 checklist.

And some of the items on the checklist directly coincide

with various regulatory requirements.

But are there any cases where

you have seen regulation

that conflicts with some of the items on the checklist and some of the ethical concerns that you would like to see promoted in data project? I think whether it's directly regulated or not, there are certain industries like health health care that are tending to favor explainable

models over deep learning models per se. And this is largely for ethical considerations.

So to give you an example here, this is what I thought was really interesting pertaining to the health care industry. There is a study done that was trying to predict the the probability of death for pneumonia patients. The idea that high risk patients could be admitted to the hospital and low risk patients could be sent home. So a deep learning model was trained and as researchers were examining it, they found that the model predicted that asthmatics had a very low risk of dying and could therefore be sent home. And this

is a very weird, very odd finding because in practice, asthmatic patients with pneumonia are typically admitted directly to the ICU, the intensive care unit, because they have a very high risk of dying. What had happened was that given the success of the intensive care unit, the model had incorrectly associated

asthma with low risk.

And so it was actually thanks to an explanatory model, a logistic regression, I think it was in this case, that this issue was identified,

thankfully before the model was ever put into practice. But you can see how without that ability to understand why the model was making the predictions that that it did, in this case, you know, it could have been recommending patients to be sent home essentially to die. And beyond

the project specific elements

that are useful to be considered as

technologists

are working through a given project. What are some of the broader or more organization wide

ethical principles

that you would like to see

practiced more widely or that don't often get the attention that they deserve? I think that it's been pretty clear over the last couple years that

the technology

industry

as a whole needs to be more engaged in the process

of governance,

more engaged with both their local, state, and federal governments in asking,

1,

how can we help solve problems that really matter?

And then, 2,

how do we go about building products that people want in a way that

improves their lives

and does not have

harms to those citizens? And if you take a look at the Facebook

hearings that happened after the Cambridge Analytica scandal,

the

level of discussion

and

nuance

that happened during those debates was

relatively

abstract.

So it was pretty general discussion about,

oh, what's your business model?

You sell ads. Are you selling people's data? And it didn't get to the point where you could have a discussion about

how you can build a company

and also take into account

the kinds of concerns for its citizens that a government should have.

So there needs to be a much broader push towards engagement

between technology companies

and governments

because that's the only way to stop this friction that we're having and also to build a bridge and a common language for discussing issues.

Because a lot of these technology companies are important economic drivers,

and we should be able to both support them,

and protect citizens from some of these harms, intent or not.

And what are some of your hopes for the future of the Dion project in terms of the outcomes that you'd like to see or any future development or tooling that you would like to see done either on the project directly or in tandem with it? Well, my hope is that DION becomes widely used,

and that it helps data scientists engage directly

1 of the exciting things about our work is that the tools that we build affect real people and they affect real lives. And because of that, I think it's particularly

important for us to be proactive

in working to ensure that our work has the positive impacts that we intend.

And I'm gonna take things in a much more technical

direction

and say there are a couple of things that I think we could use some help with that maybe listeners

could contribute

on. 1 of them is,

we'd like to support

a rich text format

or at least a way where

you can generate a checklist that goes to the clipboard, but then you could paste into a word document or a Google document that has reasonable formatting.

And, actually, that problem is a a stickier technical problem than we expected it to be. And so if anyone wants to dig into that, I think it's pretty interesting to look at that across operating systems

and get support for text with formatting that you can copy and paste. So that's 1 very technical thing where some help would be awesome. And then the other 1 is we're helping to build just a really small web application

around this command line tool so that if you're not a data scientist

and you wanna grab 1 of these checklists to drop into a project, say, you are a project manager,

but you wanna have this as part of your process. There's a place where you can easily go and generate a checklist.

And so just a small application that wraps the existing functionality

is another place where we could see people contributing

and helping out and helping to spread usage of the tool. So those are 2 very concrete things where we'd love to see some help. And then beyond that, we're very willing to engage in discussions

around

what's included,

why things are included,

what exactly they mean. And so,

if people have questions about that or wanna have those discussions, I think that we're a very open and welcoming project to

start those conversations because we think those are conversations worth having. Yeah. I'd also add that, as I mentioned, we have this table of examples of times where things have gone wrong. And 1 of the things we're also thinking about compiling is a list of best practices, right, where things were times where things have gone right. And that can be a bit tricky because

as we were talking about earlier, these things, the checklist content items can have different meanings and different contexts, and there's not necessarily a 1 size fits all solution. But I do think that it could be helpful to have some ways to point people in the right direction.

These are things that worked in this context, or you might think about this set of things. So I'd also encourage listeners if they had recommendations.

I'd also encourage listeners if they have recommendations on things that have worked and things that they're doing that seem to address some of these issues well, we'd love to hear this. And do you have any thoughts on maybe creating some sort of a badge that open source projects can include in their documentation

or that organizations

or websites can put on to

espouse their

support for

and possibly compliance with the various ethical principles that you're trying

to help reinforce with the Dion project. We are hoping that you will open that issue on the repo so that us or some other contributor could get started on it. I think that's a great idea to have 1 of those batches.

And so, yeah, please. I think that would be awesome to be able to label projects as including an ethics checklist. Let's do it. And so for anybody who wants to follow the work that you're up to and find out more about what you're doing at Driven Data. I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose

an artist who I met recently at a craft fair up in Vermont.

His name is Richard Bond,

and he does these amazing pieces of work where he takes sheets of multicolored glass and then does very detailed etching with a sandblaster

to reveal some of the different layers within the glass and give it depth and perspective. And,

his work is very beautiful and just incredibly detailed given the materials that he's working with, so I highly recommend checking him out. And so with that, I'll pass it to you, Emily. Do you have any picks this week? Yeah. That sounds super cool.

My pick this week is that I was in Portland, Maine last week, and I had some of the best espresso that I've had in a long time at a coffee shop called Tandem Coffee.

I love exploring coffee shops. And when I'm in a new place, I usually try to

try a bunch of different coffee shops, and I ended up going to Tandem, like, 3 out of the 5 days that I was there. It was just so good. So 1 other thing to note on Tandem is they also have a variety of baked goods. If those are your thing,

I'd highly recommend their ginger molasses cookie if you're there. Alright. And, Peter, do you have any picks this week? Yeah. Well, I'm gonna jump on the food trade, I think, and give a West Coast option

for people who,

are out on the West Coast because

I had and

this might sound hard to believe, but I had a life changing

English muffin

the other day.

And I know that sounds crazy, but this is real. There's a place called the Model Bakery,

and they have locations

in Saint Helena and Napa, California.

And I was in there. It was a beautiful

sunny fall day. It was a Friday afternoon.

I had just finished

a big piece of work that I was doing, so I was in a great mood already. So I might have been primed for this English muffin to be life changing. But I went into the model bakery.

I had a little bit of lunch, and then I said, ah, I want a little more, a little something after lunch. And I got their apple cinnamon English muffin,

And I just ate it without anything on it, not toasted or anything

in the sunlight on a beautiful fall day after finishing some work, and it was absolutely incredible. So if you could get up and try 1 of these English muffins,

I promise that it will change your conception

of what an English muffin is. Alright. Well, thank you both very much for taking the time today to discuss your work on Dione. It's definitely an interesting

tool and supports some important conversations, so I appreciate that. And I hope you enjoy the rest of your day. Thanks. You too. Thanks for having us.

Data Engineering Podcast

Summary

Preamble

Interview

Keep In Touch

Picks

Links