Designing For Data Protection - Episode 106


The practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and influence the design of our systems. With the introduction of legal frameworks such as the EU GDPR and California’s CCPA it is necessary to consider how to implement data protectino and data privacy principles in the technical and policy controls that govern our data platforms. In this episode Karen Heaton and Mark Sherwood-Edwards share their experience and expertise in helping organizations achieve compliance. Even if you aren’t subject to specific rules regarding data protection it is definitely worth listening to get an overview of what you should be thinking about while building and running data pipelines.

Datacoral Logo

Datacoral is this week’s Data Engineering Podcast sponsor.  Datacoral provides an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to construct its infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit for more information.  

What happens when your expanding log & event data threatens to topple your Elasticsearch strategy? Whether you’re running your own ELK Stack or leveraging an Elasticsearch-based service, unexpected costs and data retention limits quickly mount.  Now try CHAOSSEARCH.  Run your entire logging infrastructure on your AWS S3.  Never move your data. Fully managed service.  Half the cost of Elasticsearch. Check out this short video overview of CHAOSSEARCH today!  Forget Elasticsearch! Try  – search analytics on your AWS S3.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit today to find out more.
  • Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at and don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Karen Heaton and Mark Sherwood-Edwards about the idea of data protection, why you might need it, and how to include the principles in your data pipelines.


  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what is encompassed by the idea of data protection?
    • What regulations control the enforcement of data protection requirements, and how can we determine whether we are subject to their rules?
  • What are some of the conflicts and constraints that act against our efforts to implement data protection?
  • How much of data protection is handled through technical implementation as compared to organizational policies and reporting requirements?
  • Can you give some examples of the types of information that are subject to data protection?
  • One of the challenges in data management generally is tracking the presence and usage of any given information. What are some strategies that you have found effective for auditing the usage of protected information?
  • A corollary to tracking and auditing of protected data in the GDPR is the need to allow for deletion of an individual’s information. How can we ensure effective deletion of these records when dealing with multiple storage systems?
  • What are some of the system components that are most helpful in implementing and maintaining technical and policy controls for data protection?
  • How do data protection regulations impact or restrict the technology choices that are viable for the data preparation layer?
  • Who in the organization is responsible for the proper compliance to GDPR and other data protection regimes?
  • Downstream from the storage and management platforms that we build as data engineers are data scientists and analysts who might request access to protected information. How do the regulations impact the types of analytics that they can use?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at


The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show you'll need somewhere to to play it. So check out our friends over at Lynn node. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. And if you need global distributions, they've got that covered too with worldwide data centers, including new ones and Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering slash the node that's LINOD today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. This week's episode is also sponsored by data coral in AWS native server lists data infrastructure that installs in your VPC. Data coral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs rather than pipeline maintenance. Murthy founder and CEO of Dana coral builds data infrastructures at Yahoo and Facebook scaling from terabytes to petabytes of analytic data. He started data coral with the goal to make sequel the universal data programming language. Visit data engineering slash data coral today to find out more. And having all of your logs and event data in one place makes your life easier when something breaks. Unless that's something is your Elasticsearch cluster because it story too much data. Chaos search frees you from having to worry about data retention, unexpected failures and expanding operating costs. They give you a fully managed service to search and analyze all of your logs and s3 and under your control all for half the cost of running your own Elasticsearch cluster or using a hosted platform. Try it out for yourself at data engineering slash chaos search and don't forget to thank them for supporting the show. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, cranium global intelligence Alexey own data Council. Upcoming events include the data orchestration summit and data Council in New York City. Go to data engineering slash conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Karen Heaton and Mark Sherwood Edwards about the idea of data protection, why you might need it and how to include the principles in your data. pipelines. So Karen, can you start by introducing yourself?
Karen Heaton
Yes. Good afternoon. My name is Karen Heaton. I run my own data protection consultancy. I've been working in data and systems implementations and financial services for over 20 years. And I'm a real fan of technology. And my consultancy obviously specializes in data protection compliance for mainly small and medium sized organizations, including tech startups.
Tobias Macey
And Mark, can you introduce yourself as well?
Mark Sherwood-Edwards
Sure. My name is Mark showed Edwards, I'm a tax lawyer by background I now do a lot of work in the privacy space and I've got a previously business called This is DPA. Like Karen, I'm based in London, UK and like can we both do a lot of work together around the GDPR? My background as I said, as a lawyer primarily in technology, outsourcing, intellectual property, those kind of areas.
Tobias Macey
And going back to you Karen, do you remember How you first got involved in the area of data management?
Karen Heaton
Well, yes, ever since I first started working, which is quite a long time ago, I've always been involved in systems and data, whether I've been working operations for banks or other types of different types of companies, or whether I've been working for software companies, helping them implement their software solutions into their clients organizations. So data and systems has been a very large part of my life. And now with the data protection regulations coming in, these elements are now marrying together into a very interesting topic area. So that's many years of experience in this area.
Tobias Macey
And Mark, do you remember how you got involved in data management?
Mark Sherwood-Edwards
Yeah, well, probably not quite as directly as Karen's but so I been in House Counsel for a number of outsourcing companies over the summer and do a lot of HR and marketing Government or involving fair amount of data. So I've always been quite interested in it and the kind of interrelationship between data and intellectual property, whether you own data, or own data. So it's been a, an interest of mine that's been going on for quite a number of years. Now,
Karen Heaton
it's been interesting because, because Mark and I often talk about how we've given our our ages, how we've enjoyed watching technology grow from the days of very large mainframe computers, punch cards, all the way up to what we have today with a plethora of apps and now, you know, artificial intelligence and machine learning. So we both feel very privileged, actually, that we've been able to watch the growth of systems and data over the last 30 plus years.
Tobias Macey
And as you mentioned, in recent years, there been some regulations coming out that cover this idea of data protection, so I'm wondering if you can Start by explaining a bit about what that term means and what it encompasses from the perspective of data storage systems and data management. Okay, well,
Mark Sherwood-Edwards
I'll kick off from that one. So when people talk about data protection and talking essentially about personal data, the protection of personal data and the regulation of personal data. Now most people nowadays are where the GDPR, which came out in the European Union last year, which is extra, which has impact outside European Union as well. We talked about that a bit later. But GDPR wasn't particularly new, at least in Europe, it had been data protection has been around for 2030 years. And it's all and it starts really, with a human rights. In fact, the European Convention of Human Rights article eight says, everyone has the right to respect for his private and family life, his home and his correspondence. And that's that their respect for your private life and day to day Data you use and create being seen as part of your private life is where it all starts from. Interestingly, I took the opportunity to look at some corresponding us documents and the US Constitution which starts off the one that starts off with people. As you all know, Tobias goes on We the People United States, in order to form a more perfect union, establish justice, insure domestic tranquility, and that ensured your message tranquility, possibly, US constitutional lawyer you'll appreciate is the same kind of concept how one protects private life. So the general concept has been around for a long time, and that is, it's fine for companies to take and use other people's customers personal data which they may collect because people enter an account or it's a bank or you may be shopping on Amazon or whatever, provided they do that lawfully transparent And in a fair way, and essentially that means if someone's going to you're going to disclose your personal data to someone else or to a company, they've got to be very explicit about what they're going to do with it and not do things which they haven't disclosed at a time. And if that's the case, broadly, then what they do is going to be lawful and nobody's going to happy with it. Another way of looking at it is it's a trust based thing, okay. We consumers are trusting companies with our personal data. And having a trusted them with that data, we are expecting them to respect that trust by acting awfully transparent, and so on. So essentially, data. Data Protection is a codification of a trust based principle. And that's the high level view and then you can and then you can do down to various low level views as
Tobias Macey
we as we progress and in terms of the actual specific regulations as you said, it's a concept that's been around for quite a while. But in recent years, we've been more explicitly codifying it in the GDPR. And then recently in California with the CPA, I'm wondering if you can just discuss a bit of the scope of those regulations and how an organization or an individual can best determine whether or not they're subject to those rules and what particular information is encompassed by that,
Mark Sherwood-Edwards
okay, well, both the GDPR and the CPI are essentially rules about personal data and people's use of people's personal data. So that sounded very similar, and they have a lot of concepts in common. And some of the differences some of the concepts in common, for example, are applied both in GDPR and CPA. If you're going to use other people's personal data, your consumers personal data, you've got to tell them that what you're doing is generic privacy notices. Yeah, where use your personal data. These are categories of data, we're going to use And these are the purposes we're going to use them for. Okay? And and that's that applies both across the CPA and GDPR both equally have a right for the revenue consumer to get access to their data I a copy of the data that that's been held about them and both had the right to for consumers to get their data, delete it now, then the differences start starting to come. So the the GDPR is this is the kind of product often longer thinking not only was at proceeding regulation from 1998 had a longer gestation period, the CC EPA was put together in a bit of a rush to meet a deadline. So some of the fundamental differences are you get concepts in the GDPR, like lawful basis. Okay, so there's a finite amount of listening six lawful bases on the GDPR Processing personal data. personal data means as small as the same in both categories. The US tends to call it PII personally identifiable information. Europe corporate personal data, that very broad categories, include your name, email address, your phone number, but also online identifies things are there anything which can be tracked back back to you on a kind of one to one basis. So both have that in common. The GDPR has its motion of the law for the lawful basis. And you have to have a lawful basis to handle personal data. And you have to explain which law for data basis you're using, for example, and you got consent. If you don't consent. Is it pursuant to contract is not that is it as soon as a legitimate basis, that kind of stuff. And the GDPR also has some more fundamental concepts kicking around in it, things I purposely did by design and data minimization By Design means kind of what it says. It's kind of the evolution of data protection early days. And now, the general thinking is you if you have any system by which I mean, not just hardware software, but people around around the system, and you will be handling personal data processing personal data that you need, have thought out how you're going to build in privacy requirements from the start, it's no longer Okay, just trying to retrofit it afterwards. Data minimization means couple of things, what's the minimum amount of data that you need to accomplish the job, not what's the maximum amount of data and then once you have that data, when you use it in your business using the minimum of mine for each particular job, not just not sharing data around like confetti, so those are some structural fundamental things which are explicitly called out of the GDPR not not explicitly called out in cc CPA, the CPA approach is much more an opt out approach. Most things are permitted. But the primary thing is you can apply to California residents and most companies dealing with with California residents. But if it applies to you, you have the right to opt out of what I call selling of data. So you can object you can't sell my data. And effectively that means selling is a poor term for any kind of sharing licensing and settlement data. So those are those are the kind of private they're similar in some ways. They've got some fundamental underpinnings, tectonic plates, which is slightly different, which are different but that ended up driving the same kind of behavior probably for for businesses over time, which is Be more careful with personal data. But essentially,
Karen Heaton
I think Mark Tobias, you also asked about the territorial reach of GDPR
Tobias Macey
Yeah, some of the ways that you can determine whether or not the data that you're dealing with is actually subject to these regulations. And I think that the blanket approach that a lot of companies are taking is that it's too hard to identify at a granular level, whether or not somebody is a European citizen or isn't or is in some way related to the European Union or California. And so they just apply the same sets of principles in a blanket sense. And I'm wondering what your thoughts are on some of the sort of best strategies to approach the regulatory environment that we're in now.
Karen Heaton
So yeah, so I mean, I'm sure most people have seen evidence of big American companies in particular undertaking compliance programs, signing up to the US Privacy Shield to show that there are adequate data protection standards, and even doing things like updating Terms and Conditions privacy notices, MailChimp, for example. produced something called double opt in function into their platform. So that consent is of a GDPR standard for any of their clients in the in the EU. So, if you don't want to adopt a blanket approach, it comes down to then truly being able to try and understand in a granular way, what data you have in your data sets, how you acquired it, what what you're going to use it for, whether your organization actually does sell products and services into the EU, in which case, they will definitely be subjected to the regulations or even if they have websites, which EU people based in the EU can access and if those websites are running, lots of cookies, plugins, pixels, etc. Those cookie items if they're collecting Data of individuals based in the EU, then they are also going to be subject to the rules of GDPR. So So I understand the the approach to just assume everyone is subject to the regulations and then apply the blanket like but there are there are definitely ways of, you know, properly analyzing a business, asking more and better questions about the data sets that you have, and how or have acquired, and then taking a view based on that, but I agree it's difficult if you've got very large sets of data and perhaps not a lot of information around background to those data sets.
Tobias Macey
And then, from the organizational and technical perspective, what are some of the conflicts or constraints that act against some of the efforts that they might try to put in place to implement data protection whether it's because The technical systems design that they have doesn't really allow for proper segregation or tracking or whether it's a matter of policy as far as helping the different people within the organization understand the importance of these different regulations and their enforcement.
Karen Heaton
Yeah, it's really good question. I say, Tobias, because there are a number of reasons I think that we see complex constraints mean that the biggest one that certainly market I think is, you know, first and foremost, you need management buy in, of the need to understand the regulations and implement appropriate standards into your organization in order to be compliant with them. This goes back also to you know, the codification of trust that mark talked about earlier on it really is a trust journey, not only with your customers also with your employees, I think, lots of data scientists and data and analysts want Do the right thing. But if organizations don't train them in what the right thing looks like or if organizations don't give them the tools in order to do their job in a way that is compliant and meet meets the standards, it's hard for them to be able to do the right thing and in a way that they would like to be able to do so. Management buy in is really important. And also, you know, it's an investment in your business. And sometimes going through a compliance program, you can identify cost saving measures in your business. If you do a proper data audit data discovery, you map out your records of processing activities, you might find activities going on in your business that are unnecessary, you might have systems that don't need to be used or paid for. So there there can be benefits that can be gained from these compliance programs, Aside from the obvious being compliant, and also sales and marketing departments often are in conflict with Some operational deployment departments or compliance departments because they've got different goals. Obviously, the sales and marketing teams want to generate leads and get money in the door, but it has to be in a lawful and compliant way.
Mark Sherwood-Edwards
I think one of the one of the areas where you kind of see differences is if you think most IT departments are concerned security of the data, you know, all the usual things on security, which is I think of is protecting the perimeter, then you can think about the chemical plots within the perimeter. So for example, if too many people have access to personal data, people monitoring who if people are applying least privilege within the organization, so too many people I've got access to it, then that starts giving you a data protection compliance problem. Now, in fact, there's a good example in Portugal the other day, which is obviously in the European Union as a Portuguese hospital, which had something like $50 tutors, and there were 200 people in the organization who had a doctor level access to patient records. Okay, so that gives you a good example. That's not a hard. You know, it's not like you're Yes, the server has been hasn't been correctly hard. And I mean, everything's working like it shouldn't do that. Everything's been patched correctly. It's just a laxity. The business level, no one's already thinking about what actually this trust here we've been trusted to hold on to this data and pretend it properly. But actually, that would mean revising, or checking each time who has access to what data
Tobias Macey
and, you know, I know, we were talking about conflicts and constraints around the efforts to implement data protection and organizations. But you know, it's also important for organizations to realize that, you know, supplier due diligence, I'm sure many of your listeners will have gone through an RFP process or you've had to fill in a due diligence question. For a large client, you know GDPR. And data protection compliance is often a big part of that due diligence process. And you may not actually get the business unless you have certain standards and procedures in place. Another thing too, is that the initial era of big data was just capture everything because you never know when you might need it. And a lot of the current trends are pushing in the opposite direction of don't collect it unless you know that you need it. So it's interesting to see how that has been manifesting in the industry in terms of the technology choices that people make, as well as the conversations that people are having as far as how to approach analytics and data collection and data management.
Karen Heaton
I think it's great that the conversations are being had and the trends are changing. I think that shows a big awareness now of data protection in and of itself.
Tobias Macey
And then one of the big challenges in these technical implementations and in the systems designs is is understanding what data you have where you have it who's accessing it. So I'm wondering if you can talk through some of the challenges that you've seen people go through and some of the solutions that they've come up with to approach that idea of just understanding what data exists and how to properly maintain and secure and protect it. Yeah, I
Karen Heaton
mean, that is a great point, actually. Because understanding what data you've got in which systems where it came from, who's access to it, where it's stored is it is a foundational step on your data protection, compliance journey. Until you've done that exercise. It's actually quite difficult to do things like build out accurate privacy notices or decent processes and procedures in place even including things like data handling. So my experience with my clients is it one of the hardest exercises for them to do if they hadn't previously done it. So you know, if you're a smaller organization, you can take a super simple approach like capturing an Excel spreadsheet. But even that is a very time consuming exercise. Luckily, on the market, there are a number of different tools that can be used. There's there's data mapping tools, things that can map out your data flows, you can, you can enter into which systems your data stored in who the processors are, where they're located. So there's tools out there that allow you to capture all that data audit or data inventory information. And that is something that I would really recommend an organization invest in, because once you've done it, it's just a case of maintaining it. So So yeah, so that's, that's the first thing I would suggest on that one. Yeah, I've
Mark Sherwood-Edwards
interesting. I've also done it some low tech ways where I had a company that asked me this, we're at the GDPR help them kind of get from where they were, which wasn't very good. friendly to a good data protection regime. And I bought a large row of wallpaper line paper and you know, chickens very cheap, there's a lot of it Yeah, and sliced up and stuck up on the wall, we have kind of war room. And I kind of wet we knew roughly what kinds of data coming in and you could do a kind of, you know, data in that say, clarity comes in data is held accountable subsection data to the outcome lifecycle of the data. And you can work through all that and then you could get you know, scribbled bits on the post it note to stick it up is quite very analog for a old school. But you could get on people involved in bringing a new you know, you can't you can't just be the the IT guys can't be just be the data guy for the speed of compliance people. Let's be everybody in there adding their kept talking about it, sticking their bits information up. So that that worked quite well. One of the interesting outcomes is that we went into that company thinking had held 2 million records records covering 2 million people. By time we've finished we realized they held records covering 13 million people, which is,
which has been a discrepancy. But you know, no one had really thought about it right? Everyone was just doing a normal job was quite hard,
but no silence in
their silos. And one of the things that works that you can realize is this, this is not a silo. You have to do you have to kind of remove the silos, get it
Karen Heaton
work to get it to work well. It needs involvement across every department within within the organization. It's not just an IT project, which is perhaps how data protection was seen before. I mean, most I think most people would have linked data protection to data security. Well, it's it's more than that, you know, data privacy is the big side as well as data protection. But you know, on the on the topic of tools that are available, the International association of privacy professionals I'm sure you're interested might be, your listeners might be interested they produce every year they produce our privacy technology report, they list out all the vendors in the privacy tech space. And there's some really interesting information on that report.
send you a link to the afterwards, Tobias.
Tobias Macey
Yeah, I'll definitely be interested to take a look at that and see what types of systems they've got in their purview. And another thing that plays into the idea of data protection and identifying and auditing the data flows is another big challenge in the data management space of data provenance, which covers everything from what you were describing of data end up through data out and into the point of doing analytics or machine learning to figure out what records are actually being used, what attributes of those records are being used within those machine learning workflows to be able to make sure that You're not either inadvertently exposing information or inadvertently, including information that's not actually necessary for the conclusions that you're trying to derive.
Karen Heaton
Exactly. Data. provenances is usually important, especially if you I think one of the examples if you if you're part of a project team, and you're given a large data set, and you have to go and tidy up to the data preparation, for example on it, who is going to be asking the questions around why we got this data, what are we allowed to do with it, you know, at what point in your life cycle obtaining, preparing, storing, securing, then using that data, are those important questions going to be asked? and
Mark Sherwood-Edwards
interesting. So you know, one things I've seen done, which I think works well for companies up taking big datasets in our in our is applying bit like, you know, constant, you know, you're either the airport you gotta go Through, go through the customs guys check to check you the other checking your passport you go out. And one of the things I've done in some of these engagements is put in similar kind of things for data, right. So you can't import any data in the system without some kind of analysis of what the data is, where it's coming from, what rights attach and so on. And then there is some mechanism within the company which allows that data to come in, it doesn't come in automatically. And same thing for when you send data out. And in fact, you know, that was a big movement of data. And one thing we were discussing about and this is a company with deep water price processing data, so it might have financial data is on our data, fairly sensitive data on a number of people that you might send it out to third party take a look at, and we've been attachment and an email. Of course you never quite know where the attachments going to end up. So we started kind of change that so you never send the attachment. Now you send a link to too narrow data room. So you can then my person who sent it to comes and looks at the data. It doesn't believe that occasion theory, but let me see what I mean. But at least there's an audit trail, how you know how the data
Tobias Macey
is getting access. It's that kind of thing. And then another question that came up when you were discussing bringing everybody into the same room to map out all the different ways that the data is being used and discovering that they actually had, you know, several multiples more records that they needed to be concerned about than what they had originally thought is the idea of who's actually responsible for making sure that an organization is considering all of the different implications and ensuring that the company is appropriately in compliance with the different regulations and even identifying what regulations their subject Yeah,
Karen Heaton
an interesting one to talk about, because I mean, obviously, ultimately the board is responsible in a large organization responsibilities would need to be shared and appropriately appropriately defined. So, you know, I've seen it where you might start from a basis where you've got someone who's a system owner of each of the systems that you've got, and they they would then be responsible for the data within the systems. And then, if an organization is large enough, they might have a privacy team for example. And you know, the system slash data owners would then have dialogues with the privacy teams. So it is really important to get back to the foundational step which preaches your data, audit your data inventory, understanding what you've got, in order then to be able to appropriately assign the responsibility for certain aspects of data protection within the organization. Larger companies as well, could have, you know, Chief Data officers have seen that used quite a lot in some of the big banks. And, you know, even in tech startups, for example, just because they're startup, they're small, they might have hundreds of millions of records, or they still need somebody in their organization who's responsible for data protection. And if they don't have the skills, the they should go externally and social skills,
Mark Sherwood-Edwards
yeah, you should have some kind of, you know, data, you know, data, some kind of operating model, right to know how it works. And you should, you know, and you should have some kind of monthly governance every month, different disciplines get together and you have a standing agenda for work through them. You know, things like vendors right supply. So if you add source of your processing to someone else, while you're still responsible for that data, even though somebody else is processing it for you, right, who, who's checking up on the vendors to making sure that They're doing they should be doing, they've got they've got good security and all that kind of stuff. So, you know, although one probably one person in the end, you know will be responsible in the hierarchy as an operational execution that tends to be distributed responsibility for different people bringing their different angles in. And if they don't have a, if they don't have a kind of regular spot or wish to meet and discuss issues, that means, you know, it's more likely that issues will get missed.
Karen Heaton
And it all starts from training and awareness. You know, you have to give I think I mentioned earlier, you got to give your employees the chance to be able to do the right thing. And it's only fair to them, they get the training to understand what what, what it is they need to be
Tobias Macey
doing. And then as a corollary to the idea that we were discussing earlier of tracking and auditing the information that we're storing and using is in the GDPR at least today The right to be forgotten a clause where a company needs to be able to thoroughly delete information pertaining to a given individual, which can be quite complicated especially when we're dealing with complex systems with multiple different storage layers or multiple different pipelines that are replicating bits and pieces of information throughout and so I'm curious what you have seen as far as challenges at the technical and organizational level and some of the strategies and technologies that they have found useful for being able to follow that regulation
Mark Sherwood-Edwards
Yeah, yeah. Good question. Interestingly in the so the interview we got i can i think it's six or seven x data subject we call today tend to then you know consumers known as data subjects, the right to ratio the right forgotten the right subject the right to get access your data the right of corrections right to this right to the mall, but it's either six or seven. Another one more though the right we forgot to grab most of the head. lines, the one that's mainly exercised is the data subject access, right? The right to get a copy the data held out here, okay, now exists in the GDPR, access and the CPA and you get the same issue, right, multiple systems and so on. Now, talking about the subject access, right to begin with, it's not a, you know, you have to go to the end and ends of the earth to produce everything, but you've got to make a reasonable effort. Clearly, if you've got a coherent system, with everything in place, it's easy to do if you're, if you're struggling, you know, six or seven legacy systems and it gets much more complicated. A good reason to get rid of data. You don't need to be frank because you don't have to report on it. In terms of the right the right the right forgotten the right of deletion, it's not an unfettered, right. It's not on the qualified. So if I got a contract with you in the sort of customer, you can't just have your right. You can't just delete my data. Well, I can't do that with a contract with you. Even though the contracts over you know, there's reasons you can refuse for example, you're allowed to take out all the data if you think that you might need it for legal complication of you know, Legal Defense some point down the line, so it's not
as unlimited as you might think. Now
there are the UK is always been a bit more business friendly about this kind of stuff than other bits of your understanding year for the time being. So you know, things mean it's always been accepted that you could hash some of the information and that would basically count is deletion. The UK used to have this thing called putting out of use, where for some reason you might not be able delete all the data, but you could, you could park it somewhere where it was not accessible or not easy accessible to the business, you know, require two or three sign ups is difficult, difficulty difficult. make it accessible. And you can take those those kind of protections. It's not as despite it despite this exciting name, the right forgotten and most businesses, as a practical matter doesn't come up that often is not an unqualified right. And it causes less issues than you than you might expect. So
Karen Heaton
just to follow on from what Mark was saying there, the challenges to that, whether they're deletion requests or subject access requests in a complex ecosystem, where you have a number of potential, potentially interconnected systems. The first step in overcoming some of those challenges is, again, this foundational step we've already talked about, which is understanding what data you've got, where why you bought it, etc. Once you've done that, there are there are a number of tools on the market. So coming back to the privacy tech sector, there are Discovery tools that can assist with Eve, working at what data you've got where, but also, I've also seen some tools where they can bring you up a single view of an individual and allow you to perform deletion requests in a much quicker and more automated way. So yes, there are solutions on the market to be able to assist with some of the complex and time consuming requests that you may get. But it's always important to do two things. One is understand whether the request is one that is valid under the regulations and to have done your homework on your systems and your processes to understand what where and actually how much of a task is this that you can have to
Tobias Macey
undertake. And then another layer where this manifests particularly in terms of updating data, or having a custom dumber allied bits of information from their records is how it's being used in downstream use cases, whether it's business analytics or doing some sort of machine learning on aggregate data, and how that plays into the need to either regenerate a model after it's gone through a training regimen once you get the data updated, or how the data is actually being used, or what particular attributes of a record are being used within those analytics and some of the technologies and techniques that are viable for still remaining within compliance of these regulations, especially as far as some of the explain ability requirements that come up.
Karen Heaton
So yeah, there. These are the million dollar questions. Really, I think we're casing into the head definitely. Once we're starting to go down the stream from the data collection, and we're using the scientists in the analytics and the analysts are starting to run searches across the data access Come, it still comes back to the organization's responsibility and requirement to provide the scientists and analyst with a decent clean set of properly obtained lawful data. So, if then data scientists or analysts then want to access data within that data set that could be protected. There should be appropriate controls or tags, or logging or audits that the scientists and the analysts can see and be aware of when they then come to go and do the projects that they've been assigned to do. And it's also possible as a way of embedding that, the management of that at the beginning of a project if scientists or analysts or even project managers running a project that involves a scientist run on panelists, they do a privacy impact assessment on what the project is to achieve. So if the outcomes from the intended outcomes from the project are those that might result in the creation or profiled data set upon which decisions will be made, then really at the beginning of the project, they should do some sort of impact assessment of what data they're going to use, are they using in a lawful way? And what safeguards are putting in place for the results that are generated from that particular project? So yeah,
Mark Sherwood-Edwards
look looked at another way. So if you think go back to the beginning of that, the point we're talking about and being that trust, okay, so you've got you've got your data set. So the question is, what were What? What can sense of you disclose right? And then one of the reasons or expectations of other consumers are you acting within that If you want to go beyond what was disclosed and the GDPR, you can do that provided it's kind of Akin in the same ballpark, we typically need to inform the data subjects, that's what you need to do. One. So that's an initial constraint. Now, of course, GDPR applies to personal data, if you're not a personal data, and an optimization is a sliding scale, we all know so let's say if you sufficiently anonymize the data, then it stops in personal data, you're not regulated by the GDPR anymore, you can do what you want with it, provided someone can come back and under first back to the original players, but if if you can't do that, then you then you have to kind of act within within the initial parameters. And that you know, that comes back to the thinking about what you need to begin with and thinking it through, planning it through in terms of the tears kind of rise. So the the GDPR is, as the thing about automated decision making and profiling. And it's not it's not forbidden, you learn to land at profiling, you have to let your land have automated decision making, or you like to have automated processing, provided you're done the correct disclosures, there, the issue is where you have automated decisions are taken right, without any human involvement. In that case, if you're doing that you have to disclose and and the person about whom the decision is taking, as the right object has the right to have an explanation as to how, how the decision was taken. And then that brings us into things you were talking about interpret ability. So if you got some machine learning, which is pumping out decisions, and you can't explain, then you can't explain how the decision was reached. That started to give you a bit of a problem both in GDPR in a GDPR sentence, but also in a more practical sense as well as you know, Is there some bias against particular kind of people? You know, that's happened in the past before, even without, I mean, one of the concerns is that machine learning will have inbuilt biases, just find it odd, because we know that humans definitely FM So yeah, those are the those are the kind of new kind of cutting edge issues that people are wrestling with how to do how to do that. And there's a lot of thinking and discussion and kind of guidance about it. I mean, part of what I'm hoping to see is AI which validates AI. Okay, so, you know, you have a test data set and you run it through and if it produces the right art each time, you know, it's working and make something like that. Does that help?
Tobias Macey
Yeah, that's definitely useful. And some of the, some of the topics of bias to comes back to the data engineering layer and data collection as far as trying to identify potential biases that exist in the data sets and then either seeking additional or alternative sources of information to complement that, or to at least annotate the data set to say this is a potential source of bias so that the people who are performing the analysis are aware of that and can try to counteract that in terms of the algorithms that they apply to the data to make sure that they try and strip out some of the bias. But as you said, we're all human. All the computer algorithms that we use are written by humans and so there's no way to completely divorce ourselves from bias, but we can at least try to identify and account for it
Mark Sherwood-Edwards
great. And there's some examples are actually quite easy to use a machine to remove bias. So if you're gonna have a CVS coming in, you can take out the references sex, you could take out all the age references. You might not take out all the unusual foreign sounding names or play some always Smith. You know, if that's the kind of background you're from and said hello ways which the, you know, machines will actually help help counter human bars as well. In fact, if they will send you the link there's some very interesting work done by the UK Ico. On this kind of stuff they're going to I think it's a gold account it's called NAI order to enable ordered framework and they've got some kind of guys in it kind of think tank work as to what the strategies are, what steps should you go through to make your development of machine learning successful in all the sensors are successful if I saw job ad the other day about who needs to be involved in machine learning? And one of them was described in one new job and never heard before as an ethicist, which is like a practical you know, so Thomas Aquinas bit like the guy in madam sort of what the man is married to Madam Secretary, if you ever watched that Professor of ethics, but with a practical application
Tobias Macey
Are there any other aspects of data protection and the regulatory frameworks that we should be considering or ways to keep up to date with the regulations that are present and new ones that might come out that we didn't discuss yet do you think we should cover before we close out the show?
Mark Sherwood-Edwards
Well, it's definitely a kind of fast. It's a fast moving world. I know. And I mean,
I mean, Dave Jackson, stay around for a while the data is moving faster, the issues are kind of moving quicker. You can listen to podcasts and data protection. That's your thing. You can link to feed to staff attend conferences, I would say I mean, artificial intelligence is going to be a big one going forward. If you're interested interested in programmatic advertising. That's another big one.
There's a lot of movement on that coming up soon.
I think probably that is the main ones I would personally call it out of the moment.
Karen Heaton
What I think would be useful would be the opportunity for data protection to be talked about, perhaps more regularly at some of the engineering or technology conferences that happen. I mean, I remember I listened to one of your other podcasts where they were talking about the data Council, which is a meeting, the conference for engineers and developers quite cutting edge. And I had a look at the conference online and I didn't see any topic that cover data protection. So, you know, back to my point about giving employees and engineers and analysts and scientists the knowledge to allow them to do the right thing. If data protection could be, you know, brought into more of the syllabus perhaps for for those technical conferences. I think that would be really helpful. Bringing privacy and help them understand privacy by design getting it right in the beginning, I think that would be really helpful. It's hard to keep up to date with everything.
Tobias Macey
Well, for anybody who wants to follow along with the work that you are doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get each of your perspective on what you see as being the biggest gap and the tooling our technology that's available for data management today. Well,
Mark Sherwood-Edwards
known to say I think it's not a tool in my view, it's not a tuning issue, it's a cultural leadership issue. Once once the kind of culture and the leadership normalization is aligned, then the right things start out and the natural flow
guarantee into a more
Karen Heaton
toolbag I do have a to be gap actually subject access requests which We've, Mark discussed and we talked about earlier, are very time consuming requests for almost every organization. Larger companies can invest in some of the sophisticated tools out there in the market to help them automate subject access requests a lot of human decision often needs to go into. And what I'm not seeing at this point is a subject access automation tool that's accessible to organizations that aren't just a big organization. So some things bit more, you know, moving in a lower price bracket, say, to a big enterprise wide privacy management system. So that would be my, my gap from my perspective, but but certainly the privacy tech market is a really interesting market. There's lots of solutions and tools out there. There's over 275 ventures out there in the market now there's $500 million a year being vested and growing into tech startups in the privacy space. So there's a lot of technology out there in the market to help organizations.
Tobias Macey
Well, thank you both for taking the time today to join me and share your expertise and understanding of the data protection space. It's definitely something that, as you said, needs to be discussed more broadly and more widely understood. So thank you for all the efforts on that front end. I hope you enjoy the rest of your day.
Karen Heaton
Thank you very much. Pleasure.
Mark Sherwood-Edwards
Thank you. That's
Tobias Macey
listening. Don't forget to check out our other show it at python To learn about the Python language, its community in the innovative ways it is being used, and visit the site at data engineering Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering which Your story and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!