Discussion

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53

Summary

As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • This is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.__init__, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by describing what Deon is and your motivation for creating it?
  • Why a checklist, specifically? What’s the advantage of this over an oath, for example?
  • What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?
  • What is the typical workflow for a team that is using Deon in their projects?
  • Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen?
    • Have you received pushback on any of the default items?
  • How does Deon simplify communication around ethics across team boundaries?
  • What are some of the most often overlooked items?
  • What are some of the most difficult ethical concerns to comply with for a typical data science project?
  • How has Deon helped you at Driven Data?
  • What are the customer facing impacts of embedding a discussion of ethics in the product development process?
  • Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?
  • What are your hopes for the future of the Deon project?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Data Engineering Weekly with Joe Crobak - Episode 27

Summary

The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discusses his experiences working as a data engineer in industry and at the USDS, his motivations and methods for creating a newsleteter, and the insights that he has gleaned from it.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Joe Crobak about his work maintaining the Data Engineering Weekly newsletter, and the challenges of keeping up with the data engineering industry.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What are some of the projects that you have been involved in that were most personally fulfilling?
    • As an engineer at the USDS working on the healthcare.gov and medicare systems, what were some of the approaches that you used to manage sensitive data?
    • Healthcare.gov has a storied history, how did the systems for processing and managing the data get architected to handle the amount of load that it was subjected to?
  • What was your motivation for starting a newsletter about the Hadoop space?
    • Can you speak to your reasoning for the recent rebranding of the newsletter?
  • How much of the content that you surface in your newsletter is found during your day-to-day work, versus explicitly searching for it?
  • After over 5 years of following the trends in data analytics and data infrastructure what are some of the most interesting or surprising developments?
    • What have you found to be the fundamental skills or areas of experience that have maintained relevance as new technologies in data engineering have emerged?
  • What is your workflow for finding and curating the content that goes into your newsletter?
  • What is your personal algorithm for filtering which articles, tools, or commentary gets added to the final newsletter?
  • How has your experience managing the newsletter influenced your areas of focus in your work and vice-versa?
  • What are your plans going forward?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The Future Data Economy with Roger Chen - Episode 21

Summary

Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • A few announcements:
    • The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
    • If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
  • Your host is Tobias Macey and today I’m interviewing Roger Chen about data liquidity and its impact on our future economies

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • You wrote an essay discussing how the increasing usage of machine learning and artificial intelligence applications will result in a demand for data that necessitates what you refer to as ‘Data Liquidity’. Can you explain what you mean by that term?
  • What are some examples of the types of data that you envision as being foundational to multiple organizations and problem domains?
  • Can you provide some examples of the structures that could be created to facilitate data sharing across organizational boundaries?
  • Many companies view their data as a strategic asset and are therefore loathe to provide access to other individuals or organizations. What encouragement can you provide that would convince them to externalize any of that information?
  • What kinds of storage and transmission infrastructure and tooling are necessary to allow for wider distribution of, and collaboration on, data assets?
  • What do you view as being the privacy implications from creating and sharing these larger pools of data inventory?
  • What do you view as some of the technical challenges associated with identifying and separating shared data from those that are specific to the business model of the organization?
  • With broader access to large data sets, how do you anticipate that impacting the types of businesses or products that are possible for smaller organizations?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Defining Data Engineering with Maxime Beauchemin - Episode 3

Summary

What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.

Transcript provided by CastSource

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin

Questions

  • Introduction
  • How did you get involved in the field of data engineering?
  • How do you define data engineering and how has that changed in recent years?
  • Do you think that the DevOps movement over the past few years has had any impact on the discipline of data engineering? If so, what kinds of cross-over have you seen?
  • For someone who wants to get started in the field of data engineering what are some of the necessary skills?
  • What do you see as the biggest challenges facing data engineers currently?
  • At what scale does it become necessary to differentiate between someone who does data engineering vs data infrastructure and what are the differences in terms of skill set and problem domain?
  • How much analytical knowledge is necessary for a typical data engineer?
  • What are some of the most important considerations when establishing new data sources to ensure that the resulting information is of sufficient quality?
  • You have commented on the fact that data engineering borrows a number of elements from software engineering. Where does the concept of unit testing fit in data management and what are some of the most effective patterns for implementing that practice?
  • How has the work done by data engineers and managers of data infrastructure bled back into mainstream software and systems engineering in terms of tools and best practices?
  • How do you see the role of data engineers evolving in the next few years?

Keep In Touch

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA