Data Cleaning

Cleaning And Curating Open Data For Archaeology - Episode 68

Summary

Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data

Interview

  • Introduction

  • How did you get involved in the area of data management?

    I did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management.

  • Can you start by describing what Open Context is and how it started?

    Open Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports.

  • What are your protocols for determining which data sets you will work with?

    Datasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies.

  • What are some of the challenges unique to research data?

    • What are some of the unique requirements for processing, publishing, and archiving research data?

      You have to work on a shoe-string budget, essentially providing "public goods". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices.

      Another issues is that it will take a long time to publish enough data to power many "meta-analyses" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests.

  • How much education is necessary around your content licensing for researchers who are interested in publishing their data with you?

    We require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand.

  • Can you describe the system architecture that you use for Open Context?

    Open Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux.

  • What is the process for cleaning and formatting the data that you host?

    • How much domain expertise is necessary to ensure proper conversion of the source data?

      That’s one of the bottle necks. We have to do an ETL (extract transform load) on each dataset researchers submit for publication. Each dataset may need lots of cleaning and back and forth conversations with data creators.

    • Can you discuss the challenges that you face in maintaining a consistent ontology?

    • What pieces of metadata do you track for a given data set?

  • Can you speak to the average size of data sets that you manage and any approach that you use to optimize for cost of storage and processing capacity?

    • Can you walk through the lifecycle of a given data set?
  • Data archiving is a complicated and difficult endeavor due to issues pertaining to changing data formats and storage media, as well as repeatability of computing environments to generate and/or process them. Can you discuss the technical and procedural approaches that you take to address those challenges?

  • Once the data is stored you expose it for public use via a set of APIs which support linked data. Can you discuss any complexities that arise from needing to identify and expose interrelations between the data sets?

  • What are some of the most interesting uses you have seen of the data that is hosted on Open Context?

  • What have been some of the most interesting/useful/challenging lessons that you have learned while working on Open Context?

  • What are your goals for the future of Open Context?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA