Cleaning And Curating Open Data For Archaeology

Hello. Welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking,

scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

Go to data engineering podcast dotcom/linode

today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to data engineering podcast.com/chat

to join the community and keep the conversation going. Your host is Tobias Macy. And today I'm interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data. So, Eric, could you start by introducing yourself? Hi. Yeah. My name's Eric. I, direct the Open Context Project and a nonprofit

publishing service

for archaeology,

archaeological research data. And it

is now let's see. Started out in 2006.

It's gone through several different iterations over over the past several years, and we currently have about 1, 500, 000

entities that we've published. To talk about more what that means later,

from

roughly about a 1000 different researchers and,

institutions

around the world. And it is

really intended to be the, a way of sharing a lot of the structured data that comes out of the world of archaeological excavations and archaeological surveys. And do you remember how you first got involved in the area of data management? Yeah. Mostly, it was born out of frustration. I was doing my graduate work, my dissertation work in the late 19 nineties in, archaeology.

My background was focused on studying the early bronze age and,

the impact of the create the formation of

the Egyptian state and civilization on its neighboring regions. And that work involved

building some databases

and some GIS.

And I was

really interested

in seeing,

you know, how the data that I was creating and look and exploring and how it would relate to the data that other researchers had also developed. And it's super frustrating because there was really no access to their data. So we would get these very,

in some ways frustratingly,

superficial publications where they would summarize a few results from excavations

that were related to what I was looking at, but I was never able to see, like, the structured databases.

What was in the what are the actual counts of the different things that they were finding and how did that relate to what I had? So that was 1 of the real reasons why I wanted to get into this area is because it was a there was a real niche there that there's really a strong need to be able to compare these different datasets and try to see the bigger picture. And decades later, we're still working at it,

and it's a and it turned out to be a much harder problem than I initially naively thought it would be. And so along that journey, you ended up

working on building the Open Context

platform and organization.

So can you describe a bit about what it is that you're doing at Open Context

and the mission that you're trying to drive towards and how the overall project got started?

Yeah. So we recognized early on that there are conventional

modes of publishing

in in the research world, in academia, were just not sufficient for managing structured data. So conventional articles, books and reports,

those are the sort of bread and butter of

scholarly communication between researchers

and that's that world of publish and perish. Right? So people are publishing

at a feverish pace, but those, publications are very difficult to use in any sort of for for any sort of quantified analysis or difficult to aggregate

all sorts of different difficult issues with them,

in actually reusing the information that they're presenting

in these conventional publications. And the other issue is that there's a big issue

with archaeology in that it is often relying on destructive methodologies. So when you dig a site, you're actually destroying that site.

And in the process of excavation,

you have a really strong professional and ethical imperative to do very detailed recording of what it is that you're encountering when you do that excavation. You have to

know where you're finding everything. You have to know stratigraphic

relationships between different deposits. You have to know where different architectural features are. There's a huge amount of recording that has to take place, and that recording is actually quite complicated. And it's typically

done using databases of 1 form or another

to, actually record this excavation process.

And unless we come up with ways of keeping that information, keeping those records

in in those databases,

then that process of excavation destroying sites,

all the information

that comes out of that will be lost unless we do something to preserve an archive and share these digital data. So in addition to, you know, opening up new research opportunities, which is 1 of the things that got me excited about this issue about, you know, managing data, we have this really important ethical imperative that actually, you know, this is the way that, we're going to pass an archaeological record down to future generations

if we are successful in the management of this research data. So those are the, sort of driving forces behind this, and then open context is our attempt to try to provide, you know, practical real world services in order to try to meet those sort of larger needs

of dissemination,

opening new research opportunities,

and

also putting these data into formats,

into a larger public context and into digital repositories where they could be preserved. And so the datasets that you're

managing,

are they solely dedicated to archaeological

research, or do you have other scientific domains represented as well? The vast majority are archaeological.

We have 1 test dataset in public health, but for the most part, we're busy enough to with the archaeologists

that we're really focused on that need. And the other issue is that our

publishing model, the way that we

curate the data, really requires some domain expertise. So just because of who we are in terms of our own background and staffing in LVAD, we do focus on archaeology,

not other other outside domains

where there could be also very different kinds of data modeling and metadata

and ontology kinds of concerns that beyond our expertise and comfort level. And in terms of determining which datasets

you are willing and able to work with, do you have a particular set of guidelines or protocols for when you're first interacting with somebody who comes to you with a particular,

set of records that they want to publish on your platform?

Yeah. Absolutely. It's important for us that the research that we publish is coming from projects that

are meeting the,

normal standards of professional conduct that are in the discipline of archaeology.

And these are professional societies in the field have, you know, ethic codes, professional norms,

and there's also laws that govern how archaeology is conducted in different jurisdictions around the world. And so all of those different kinds of standards

and norms have to be met for us to,

engage with the researcher to publish a dataset.

So this is an important thing because

with those sorts

of conduct professional conduct frameworks

help establish,

you know, kind

that we are working for the archaeological research community, and we don't wanna provide a platform for basically for for for people who are doing treasure hunting, you know. So there there's a

a world out there of people who are doing things like, illicit metal detecting and whatnot and treasure hunting

that, would actually if if we were to publish that kind of information, may endanger sites. They might get vandalized, they might get looted, and we wanna make sure that we're not working with not facilitating

the destruction of cultural heritage.

So this is, so working with a professional community of people who agree to a common set of ethics,

that's something that is,

an important aspect of, who we work with. And in terms of the actual data that you receive from these different research projects

and archaeological

excavations,

what sorts of data formats are you dealing with and some of the unique challenges

that come along with

the nature of the data that you're dealing with in terms of how it's obtained, how it's recorded, and how it's structured?

So most archaeologists are not necessarily

experts in databases, and, and the

domain typically

involves

a lot of use of

pretty normal sort of office suite kind of products. So,

the more sophisticated

archaeologists are using relational databases,

databases like FileMaker

or Access,

especially. A lot of people use Excel for

recording structured data and there's a lot of, variability in the kinds of consistency that of the data that people record. So some people do

have different kinds of protocols in place for data validation so that the datasets are more consistent and a lot of people don't. And so there's this kind of an issue that some data need a lot of work and,

after the fact cleanup and that could be pretty labor intensive.

1 of the other issues is that archaeologists

collaborate when they build their datasets typically. So an a single excavation may have different individuals who are documenting, describing archaeological context, which are different kinds of deposits of dirt that they

dig, and they would have different specialists who would be studying different

the different classes of materials.

So typically, there'd be,

say, animal bones that are recovered from an archaeological site and those are described by a zooarchaeologist,

somebody who has training in zoology

and anatomy.

There could be seeds, charred seed remains that are being described by a botanist.

There and there could be different experts who will be studying,

different parts aspects of material culture. So pottery, coins,

stone tools,

metal implements.

Some other people might be studying

different

kinds of, artwork or sculptures

or all sorts of different kinds of materials, sometimes even textiles. So there's a huge number of people with very different kinds of expertise and they typically create their own datasets.

And the main way that these different datasets can be related to 1 another is through a shared context of where where is it that an object or a bone or a seed was

found, which archaeological context,

was the source of that

material. And 1 of the

interesting and challenging bits is that because different researchers are basically managing their own datasets

individually,

bringing them all together can actually be a lot harder than you think it should be because, you know, if somebody

writes, in their Excel spreadsheet that a certain bone comes from locus, which is a common term for an archeological deposit,

locus 10,

And, they write l dot 10, and then somebody else in a different database is looking at stone tools from that same locust.

They might write locust 10 or just 10.

So there's this issue of inconsistent identifiers

in order that are used to reference

archaeological context. And that could actually be a very big headache in order to relate these different kinds of materials altogether

for just 1 archaeological site. And so this is why,

there needs to be a lot of investment in

trying to go through and understand the different ways that some of these

identifiers are expressed and just reconcile

them so that you can actually

bring the materials together in the way that they should be.

And

in the process of onboarding these different datasets from various research activities or dig sites,

As you were saying, there's a lot of domain expertise that's required to be able to make an effective use of that source data and convert it into the schema that you've established for OpenContacts.

So can you talk through some of your strategies and tactics

for ensuring that you are able to

make an appropriate set of transformations

for these different datasets

and try to

extract patterns

to make the process more repeatable and also any issues that you encounter

during that activity as far

as mitigating

data loss because of those conversion efforts?

Yeah. I mean, on the aspect of data loss, whatever

the open context is a very, very abstract to generalized

global schema.

There's not much more to it than sort of almost like a triple store or a key value pair type of thing describing every entity. And we have some common metadata requirements and some common rules of inference around all that that help make things a bit more discoverable. But, really, a researcher

who submits their data dataset to us can describe that those data with

any sorts of attributes that they want. They often have their own control vocabulary. Sometimes they're gonna be referencing a shared control vocabulary

that might be professionally curated. So, know, especially the people work with animal bones,

pretty much they're,

classifying

the species, the biological of the bone in very similar kinds of ways, and so that becomes an easier issue in terms of sharing

linking across different

classification systems.

But for the most part, people come up with their own idiosyncratic ways of describing stuff. And Open Context's main assumption is that stuff is related via contextual relationships,

that a certain record will have relationships to other records that we can describe. So a lot of the

you know, we're very sort of flexible, I guess, in terms of the different kinds of schemas that we accept

in terms of especially with descriptive attributes.

The the hard thing really centers in making sure that we're understanding

the identifiers correctly in the source dataset.

So context identifiers are the thing that,

we focus a lot of attention on. And the other thing is just, you know, what is it that is being described is 1 of the things that we would care a lot about. So sometimes, depending on the way that somebody structures the dataset,

there could be multiple records,

say, they're just using a flat table like an Excel spreadsheet. 1,

multiple,

records

are describing the same thing. It's just that they wanna provide add multiple attributes to that 1 thing. So

like a a,

a coin might have multiple,

motifs that are on it, and they would have multiple rows in the spreadsheet to to add

that there's multiple motifs on a certain coin or a certain pot shirt or whatever.

And,

we need to understand that, okay, this is the same coin, it just has multiple attributes. It's not the a mistake that this thing is repeated over and over again.

So that's so there are these those are the sorts of, questions that we have to look at when we look at, mapping the, schemas from a source dataset and going through our extract transform load process

and moving things into open context. Yeah. Essentially, it's just that, you know, we have to understand what is being described

and, you know, are they being described by attributes that could take on multiple values or not, and then what are the relationships between the things that are being described.

And for the data that ends up in the open context system,

you encourage the use of Creative Commons licensing. So I'm wondering if you've had any issues in terms of needing to educate researchers

as far as the implications of that licensing or if you've had any pushback from people who initially approach you once you mention those licensing requirements and just your overall

considerations

of using Creative Commons as the license category for the data that you are hosting on your platform.

Yeah. That's, I mean, the whole issue of intellectual property in this space is a huge area of research, and it touches on a bunch of different

considerations. So there's,

there's a sort of practical

interoperability

consideration that, you know, if we all use common standard licenses, then the content that we have, would be legally interoperable

and, which is nice and that's 1 of the sort of wonderful things about Creative Commons, you know, it's a standard licenses. You can have interoperability,

and it's all expressed with standard metadata too so that, you know, in a machine readable way, you can know what the copyright status is of some content. So that's great. But 1 of the big complications is that, you know, it's not just interoperability

that we need to optimize,

especially in the field like archaeology, but in archaeology, we're dealing with

lots of different communities around the world with different sets of values

and different assumptions about what the archaeological past means to them and who owns that archaeological

past,

you know, who who can speak for it. And those kinds of concerns

mean that the ethical landscape around using creative commons licenses is a lot more complicated. And so we actually say in open context, you know, open access, open licensing, these are wonderful and powerful tools, but they're not universally

appropriate. And so there are definitely going to be circumstances

where open context is not a good platform for sharing data. And so if the data have specific kinds of sensitivities, especially, say, if there's indigenous people who regard this information as, important for their own heritage, you know, especially in situations where there's maybe

a history of colonialism,

then you have to be very careful about, license choices and whether or not a platform like Open Context is an appropriate way of disseminating the data. You might have to find some more restrictive mechanisms

in order to take a sort of more judicious

and sort of situationally aware approach about all that. So it's,

it's an interesting kind of an issue. 1 of the other issues about copyright licenses,

like with Creative Commons is that they're copyright licenses. And in the United States,

there's this distinction between facts and expressions. So factual

data, kinds of things like measurements and, you know, the height of Mount Everest or something like that. Factual data is typically not seen as something that copyright actually covers. So there's a sort of an ambiguity also about what aspects of a dataset are

expressive and that are sort of

the domain of copyright and what aspects of an archaeological dataset

are more factual,

where copyright probably doesn't apply anyway. And so no matter what license you apply to it, the copyright is probably

not legally applicable anyway. So there's a those are the kinds of issues that we have to sort of walk through with the the research researcher community and also essentially the professional community to try to deal with some of these questions because we wanna make sure that we're trying to use these tools like,

interoperability,

license interoperability,

open data. We wanna

have that to lead

to good outcomes and not and not harmful outcomes. And so that's why we want people to be, thoughtful about how they're applying these kinds of tools. And further along the topic

of expressiveness

versus just factual information

is

any research articles that might either accompany or

reference

the data that is stored in OpenContacts or that's being submitted to OpenContacts.

So I'm wondering

in terms of the metadata that you track for a given dataset,

if there is any either reference or content

of any research articles that might be associated with those datasets

and how

any sort of industry journals or publications

for archaeology,

any relationship that you might have with them, whether it's positive

or ambivalence?

Yeah. Actually, it's, mostly been very positive in terms of, the sort

the conventional publishers

that we work with. And there's a couple different issues. 1 is that citation

is the main kind of currency in the academic

research and promotion, tenure, that type of thing. So the you want

we definitely want people to cite datasets because people put a lot of work in creating those datasets, and then they put a lot of work in

annotating, describing. And, you know, we put a lot of work working with researchers and cleaning them up and making them hopefully more usable by publishing them with open context. So

that effort,

we want to

recognize a reward because it helps motivate people to do the right thing in sharing data.

And if

people get citations

and people cite cite published datasets, then that should start nice positive feedback loop of people getting that kind of recognition and then, you know,

and hope the ideal scenario

is that researchers will do well by doing right, you know, that by sharing data, then they will advance their own careers. So in that sense, we participate with a variety of mechanisms in order to try to make citation

easier

and, and more meaningful.

We OpenContacts

uses a service called EasyID,

which is hosted by the California Digital Library.

California Digital Library is the main sort of institutional

digital service,

institutional repository

for the University of California system. So all the different campuses of the UC system

are involved with it, and EZ ID is a service to mint persistent

identifiers. And in our case, the persistent identifiers that we meant with EZ ID are we use DOIs

for large aggregations of data. In open context, those are called projects. We also use DOIs to identify

tables. So another large aggregation of data, which would like a table that would be a dump a CSV dump of

maybe thousands of records that we would express.

And then we also use something called arcs, which are called arch archival resource keys, and that's another persistent identifier.

And we use that for more granular kinds of content, in order to facilitate citation or something maybe very specific. And this is 1 of the advantages of open context versus,

sort of more conventional data repository is that, we can make very specific entities

of interest to an archaeologist

directly citable. So, say, an example could be a coin. So somebody might discover a coin. It might have some interesting you know, it might be minted maybe in Rome and discovered

clear across the empire all the way over in Turkey, and that would be kind of interesting to be able to trace while that really traveled far in antiquity.

And you might wanna reference that specific object to that coin. If we only published,

or only curated

datasets as they sort of came to us and big

tables of spreadsheets and big relational databases, that coin would not be something that you could actually cite because it would be

a record,

that might be in a giant data table, or it actually might be information that might be scattered on several different data tables. Right? So

1

Excel spreadsheet might describe the coin as described by the numismatist.

Another relational database might describe a coin that was just that, you know, as it was recovered by

somebody creating an object inventory. And another relational database might describe the contextual

relationship,

the context where that coin was actually discovered.

And, when we publish things without the context, we're bringing all of that information from all those different source data files together

to create a more cohesive

picture of what the excavation results are. That means that that information about that 1 entity, that 1 coin, pulling that information out from several different sources and actually making it very convenient to be able to cite and discuss

in,

publications.

And that's that's 1 of the sort of

reasons why a lot of researchers

find our approach actually kind of interesting and is that, yeah, they wanna be able to talk about

a coin. They don't necessarily only wanna cite a spreadsheet,

right, that could have thousands of other things in it too.

So that's that's 1 of the key aspects of,

all of this is that it's not just the sort of,

fitting into that academic cycle of rewards of, you know, the citation attribution

is is how you advance your career,

but citation is actually a really important aspect where people make sense of things. You know, you want to be able to talk about a specific

object or context or cite some other, some other entity of interest. And in order to talk about it, you have to be able to point to it. You have to reference it.

And citation bit plays an important aspect in that as well.

And in terms of the technical architecture

that underpins

the open context platform

and the ability to manage these

entity aggregations

and

expose the data for these direct citations and for being able to discover interrelations between the data.

Can you just discuss the overall technical platform that you've built and,

some of the challenges

and architectural evolutions that have occurred along the way?

Well, currently,

well, first of all,

the entire stack, everything that we use is fully open source. And,

currently, the current iteration of open context is, it's a

Python 3 application using the Django framework,

and we have a Postgres relational databases or primary data store in the back end, and

we use an Apache Solar index.

It's basically a big document, NoSQL database for things like faceted search and

not a lot of Open Context is really meant for sort of itself or data analysis or visualization.

What we're trying mainly to do is to make the data that we curate

browsable,

discoverable.

You can, actually look at it, and hopefully, it's somewhat aesthetically pleasing at least.

That also matters because it's publishing and there's the aesthetic element is an issue.

And, then once it's browsable and discoverable and all that, we wanna try to make it relatively easy to be able to grab the data that you want, and then you can use your own tools that you're comfortable with to, actually do your own analysis and visualization. So mostly for the community of people that we serve, that would be like

browsing around,

seeing, here's

a set of pottery from this 1 archaeological site and then you narrow it down. Okay. So I'm just selecting the the certain time period of my interest, and then you can just export a CSV table of that and then walk away and play with it in Excel, something like that. More sophisticated users can use our API to do, more sophisticated kinds of things with it. The API that we provide

provides

most all of the data

in, GeoJSON, which is a very popular geospatial data format. And you could also usually interpret the data that we have as JSON LD and then convert have that converted into

a a graph of RDF triples.

So there's

different options for for using it, but but, basically, we're mainly aiming to try to make the data easier to discover so that you can grab and use in the tools that you're interested in. When we initially started, we were working at PHP and had a MySQL background,

back end. And I guess I started working in Python probably 2013.

It's when we,

made the switch, 2013, 2014.

And in the process of that switch going from MySQL to Postgres and PHP to Django, was there a high level of difficulty in terms of being able to translate

the data that you had present on the platform at the time to the new system,

or was it a fairly clean mapping

where because it was simply just different relational databases, you were able to largely do an export and restore just as a set of SQL statements? Yeah. Actually,

weirdly, we went through XML as an intermediary. So the the PHP by SQL

version of open context, instead of having mainly a sort of GeoJSON structured data format that you could use as a view for all the different items, it was a XML

structured data format. And, this goes back to some of the sort of intellectual

pedigree of open context. There's a system at the University of Chicago called OCCR, Online Cultural Heritage Research Environment, and the initial incarnation

of OCCR,

was I mean, it still is an XML database, a native XML database,

and they had a schema that they,

publicized called

archaeological markup language. Essentially, using rqml as

our main kind of schema in our own back end,

a simplified set of the archaML schema actually.

And so, when we did the export

from,

migration from the PHP MySQL system in to our current

iteration, it was using those XML documents as the source data.

And as far as the ontology that you maintain

and the taxonomical

details of the datasets,

is there a lot of difficulty

as far as being able to

map the different entities

to those sets of attributes in order to make them more easily

linkable so that you can traverse any relations between them. And then in terms of the metadata, I'm curious

if you track the

dig or the specific research project that the different entities are derived from so that you can have sort of a meta level of

linkage. So where you track the different digs in relation to each other, and then from there, dig into the separate entities that are discovered?

Yeah.

So

every,

every single project, as I mentioned, has

and even within a project, there'd be a lot of diversity. Different researchers who are working on different kinds of classes of materials, so like the pottery specialists, the bone specialists, the stone tool specialists,

would come up with their own set of descriptive attributes to describe

the materials that they're looking at, and, they could have their own control vocabularies

and they in archaeology, those are called, typologies usually, sets of measurements that they're interested in, etcetera, etcetera, etcetera. And that is,

unfortunately, the sort of state of discipline is that there's not a huge amount of consensus between researchers describing similar kinds of stuff.

So that that diversity

is

1 of

the reasons why we have the architecture that we have and then in that we have to

publish the ways in which archaeologists are describing their materials. We have to publish all of these different kinds of attributes

as data also,

and that's part of the sort of intellectual contribution that these researchers are creating because, you know, they're usually encountering stuff that very few other people have encountered, and

they are making it up as they go along. There's this is an active area of of of of research where people are trying to find

new and better ways of describing

archaeological deposits or archaeological

material culture,

that sort of thing. So this is like you know, they have a lot of, intellectual investment, basically,

in creating and defining their own descriptive attributes.

So when we publish datasets, we're also publishing

a custom set of attributes associated with those datasets.

And in order to achieve some

interoperability

where

we can, we

use the SCAS, a simple knowledge organization system, to say that, this 1 term in this 1 researcher's data set, you know, let's say, control vocabulary

term, might be a close match to this concept in the Getty Art and Architecture Thesaurus.

And so that would be a common standard that would be widely applicable in cultural heritage. So a good example might be a type of Etruscan pottery called bukero.

So it's a the bukero is

this really pretty,

sort of dark gray shiny,

heavily burnished kind of pot a pottery.

And it is,

it might be described in a couple different ways and different researchers' datasets, but,

there are,

some museums and the Getty Art and Architecture Thesaurus that have their own control vocabulary

that describe that. And Getty Art and Architecture Thesaurus is a nice openly licensed

link data kind of thesaurus, and then we can say,

this term bucurro is a close match for what Getty said how Getty defines it as bucur.

And then we could essentially make those those common linkages

across these idiosyncratic

kinds of descriptions

to point to some common standards. And so that is

more or less

the sort of level of semantic

interoperability,

I guess, that you would get in open context

where we'd we basically

start drawing relationships between different

vocabulary

terms. So it's not that sophisticated.

We're not in the sort of business of supporting sort

of inferences on owl ontologies or anything like that. That would be an awesome,

thing to do more in the future, especially,

if we get to collaborate with some people with those sorts of interest and expertise.

But we we do get is the ability to, you know, combine different datasets, and we've been mostly most successful with zooarchaeology, with animal bones,

where people have, we've basically seen that kind of technique. You can actually do searches now and discover all the cattle bones and, you know, all the tibia of cattles from

dozens of different archaeological

projects. And that is a pretty big accomplishment in a field like archaeology where everybody's doing their own thing. And once the data is written into

open context,

is it largely static at that point, or do you have people who will periodically

go back through and add additional information or metadata

or linkages

after the fact? And is that something that users of the platform are able to contribute

as suggestions or enhancements to different datasets?

That would be really interesting to be able to support that. No. We haven't built out that sort of functionality of sort of, sort of user contributed

semantic enhancement or or even just, you know, hey. There's a bug here, an error.

1 of the things that we've, been experimenting with and and and and are gonna be moving ahead with in the near term is, also just version control of data. For the most part, most of the data that we publish with OpenContacts is once it's published, it's pretty static. The main changes are usually going to be additions

where, you know, we might publish a set of context first, and then the pottery person is ready with their pottery data and we add to that context

dataset pottery and etcetera etcetera. So most of it's additions, but sometimes there's gonna be error corrections

and and that type of thing.

And we have essentially, because we generate all these JSON LD documents, we have used Git actually as a a version control

mechanism for all of that. We're doing more of that internally because it was starting to get unwieldy with Git. We're just building these giant bloated,

Git

internally, and, we're trying to figure out a good way of of of expressing that publicly of what those changes are. So that's a actually an area of active development, but we do have a sort of an audit trail of how things are changing internally.

And that brings me into

a couple of my next questions. 1 of which is

the average size

of the typical dataset that you would be processing,

just in terms of relative volumes? Are they generally in the megabyte, gigabyte, terabyte scale?

And then I know that part of your overall endeavor is also to serve as an archive

repository

for a lot of this information.

So I'd be interested in exploring some of the complications that arise from trying to maintain a long term

record of this information,

particularly

as

storage formats

and physical medium changes over time? Well, I'll just start with the archival issue first. We're not a preservation repository because that involves a whole host of

institutional requirements and workflows and processes that,

we just don't have the

capability, we don't have the scale to be able to achieve ourselves. But what we do do is we when when we publish data, we put them we put the data that we publish

into preservation repositories.

So, but those are managed by other institutions.

So, historically, the main institution that we've worked with and we continue to work with is the California Digital Library,

which is, again, the main repository of the University of California system.

We're also using Zenodo,

which is another repository

that's in basic it's at CERN, the, you know, the the particle

physics research lab in in Switzerland. And

the rationale for using Zenodo is that they have a very nice convenient API, and they support a lot of metadata

using the linked data entities that find useful. And there's already a lot of sort of convergence on using them by other,

archaeologists. So it's it's the the material is there with,

there's a good,

sort of a community of people already engaged with Zenodo. So that's,

that's for the for for long term data preservation,

we don't,

actually,

you know, we're not gonna be the end place for it. But what we can do on our end is to try to make that data preservation

more feasible, more likely to

actually work out in the sense that, when by publishing data with open context,

then we're extracting data from a bunch of, different,

source

datasets with very different kinds of file formats and whatnot and expressing it as structured data and, you know, adjacent JSON LD for the most part,

CSV, so these nice simple text formats that are lend themselves to preservation.

And then the other thing is that, oftentimes, we often get a lot of media associated with the structured data that we publish. So, you know, talk about coins or pots or whatever, 1 of the nice things is about having that sort of level of granularity and citation that we have is that we could also do things like relate specific pictures to specific objects

or specific pictures

or even 3 d models to objects or in context, and we do do that also. And when we publish the material in Open Context, we're also depositing a lot of media files, binary media files

of 1 form or another,

into digital repositories also. So those are the ways in which we we work for,

archival

in the archival sense and providing standard metadata and everything that hopefully makes the,

job of the digital,

archives

easier in order to maintain the data. Right. So, I mean, it it varies all over the place. So, some datasets are quite small,

say, a megabyte spreadsheet or something like that, and and they and then they go up from there.

What really bloat size or expands size greatly is if somebody has a lot of images or the digital media.

So, 3 d files are take up a lot of space, obviously.

But,

in in certain circumstances,

we recently published

a set of,

some archival materials for the describing the great sphinx

at Giza in Egypt.

And, you know, even 1 file

1 1 scan image,

could be a gigabyte

itself. So, and that was

that that's and we would have, something like 30, 000. So,

that there's sometimes some large space and storage concerns. And, again,

we really couldn't do what we do without having that, fundamental infrastructure, the services being provided by the digital repositories, the digital libraries,

hosting all these things over the long term because hosting all these things on our own actually starts getting expensive too,

with all those files.

And as you mentioned,

you don't do

any in house,

complicated analyses or visualizations

of the data that you are hosting. So I'm wondering what some of the most interesting or

unexpected uses

of your data you have seen out in the wild? Yeah. Probably the most significant use of data for us has been the digital index of North American archeology,

and it's the DENA project is this acronym.

So what DENA is is

a aggregation

of data that's a little bit different for us in Open Context is that most of the data in Open Context come to us from academic researchers.

DINA, the datasets coming to us in that project are coming to us from state government officials.

And so in the United States, there are historical

preservation laws

that are, at the federal level,

and the federal government has delegated the enforcement of those laws to different state governments. And so, each state has officials in charge of,

managing cultural properties, historical resources within the state borders.

And each state develops,

information systems, so databases basically to help in that administration.

And you can imagine there are about 50 different,

schemas,

datasets.

50 different ways

that states have sort of tried to figure out how to manage

historical and archaeological sites. And

what's interesting about that is that,

the state governments have the best window

onto

the sort of what is known about

the past in terms of what is known about the settlement history of

the North American continent,

since the Pleistocene, because the states have aggregate all of this data that is being

required by law to be able to to to to create. And, what's the neat thing about Dina is that we're gathering,

by bringing

these datasets together from several different states. For the first time, researchers are able to see a much bigger picture about what human settlement is like across the North American context over very deep

long periods of time. So you can see what things look like 10000 years ago in terms of what's known about where there are archaeological sites

and compare that to what what the situation was like 500 years ago,

the beginning of European colonization.

So that is a,

a really powerful kind of a resource.

There's all sorts of different kinds of really interesting issues in terms of, like, these data sets are also

not created with any sort of systematic sampling at all. There's all sorts of biases in them, and that's something that I think is a really fascinating area to explore and it's gonna be a huge challenge for interpretation.

But nevertheless,

this is this dataset is,

inserted as big as it gets in archaeology,

for looking at big picture kinds of questions.

And the the big research outcome actually,

hasn't

focused on so much of, like, you know, where people are living in the past, but the big research outcome we've had with the DNID data set has been

modeling what's gonna happen to these sites in the future.

And we have most of the coverage of DNIS focuses on the American Midwest and the Southeast, but we have a lot of the, Gulf Coast and the, Atlantic Seaboard in the DNIS dataset. And,

this paper that was published in PLOS 1 described what the impact would be with, say, a 1 meter, 2 meter, 3 meter rise in sea level,

driven by climate change.

And, just with a very,

modest 1 meter rise in sea level, which a lot of projections are saying is gonna be happening within,

you know, next few decades,

we're gonna lose 13, 000 known archeological

sites. And, you know, that had a huge amount of press coverage and media impact because a lot of people didn't even know that there were 13, 000 archaeological sites in North America, let alone just along the coast that we'd be losing to, sea level rise.

So that 1 was a big that made a big, impact,

pardon the pun. It made a big splash,

and it was, recently cited in the 2018,

national climate assessment,

which was a major,

US government report about the challenges of climate change. So that that's probably 1 of the key research outcomes that we've had,

with Open Context, and then there have been, several,

smaller but still interesting and significant ones. Again, with the zooarchaeological

data, people be able to see

patterns in the way that animal domestication

that which happened first in the, near east in in places like

Iraq and Syria and Southeast Turkey

and how those,

domestic animals and the and the economies around animal husbandry,

spread, towards Europe. That's something that was facilitated with Open Context and there's been,

publication with that. And then the data that we published originally with that,

has been reused in other publications, which is also interesting. So there's actually reuse of this published data for more than 1 research purpose. And then,

there's,

other

applications which are kind of more fun. People using the API to do things like pull in the data for virtual reality or for augmented reality kinds of applications.

Those are smaller scale, but what's fun about that and what's also useful about that is that we really need to build up the technical skills in archaeology. People who have the data skills to be able to engage with the system like Open Context

and develop the sort of understanding about how to use a web API and what can you do with it. And so

those are the types of things that are maybe in the longer term really important because

that starts getting people more engaged with this whole idea of

why shared data

is important and, you know, what can you do with APIs, or whether it's an open context API. Oh, look, there's data from a completely different source. Maybe you combine it with geo names or combine it with some other,

Wikidata or something else. And then and then that starts to maybe,

enable some really interesting kinds of applications.

And in the process

of

building and maintaining the open context platform

and working with researchers and archaeologists,

what have been some of the most interesting or useful or challenging lessons that you've learned? 1 of the biggest things is that it is

incredibly slow to change a disciplinary culture, and and that is that, you know, most academic researchers in the world of publish or perish, and so they have a lot of professional pressures to

make publications

that they're sure that their peers will recognize as something is valuable. And so that,

sometimes leads to some risk aversion, like publishing data is something that's relatively new and it's not a sort of tried and true path to tenure.

And so that's 1 of the reasons why it's,

adoption has not been as fast as we would want. Fortunately, that situation is changing now that we're starting to see some interesting reuses of data and people are starting to cite it, and it's appearing in bibliographies

of research papers and whatnot.

So things are getting are definitely getting better, but it but it has taken, you know, more than 10 years to get to this point, and that and that and,

you would think that that would be something that would have been more of a no brainer. Other issues that are also really fascinating is just that, again, that what gets people excited is the the there are a bunch of things that get people excited. So some some research really like seeing their data online. They like to they want to show it off. They want to see it's a great way to have sort of an exhibition. So, you know, there are museums and,

other institutions

have beautiful websites that show off their collections, and and,

researchers

are happy to use a service like OpenContacts sometimes so they can participate in that. Show off, you know, what is it that they've discovered and and also show off

in a in in a lot of ways that, look, I do such really awesome and rigorous recording. Look at all this wonderful data that I've got. So that's a so that's a a a positive incentive, and so that's something that I think has been underappreciated

in the world of research data management, and that a lot of times, there's a sort of discussion of carrots and sticks

about this. And and,

there's a sort of efforts to make sharing data mandatory

and and that, you know,

if you want a grant, you have to share your data. And

in a lot of ways, I think that's I'm very sympathetic to that. I think that's, you know, definitely the right thing to do, but we also need the positive incentives. So, you know, citation is obviously a positive incentive, people doing coming up with good research outcomes from shared data, that's a good positive incentive. But the other side is just that we want people to feel

that publishing and sharing data is something that's recognized rewarding.

It helps make their data look more attractive. It's more usable.

Their own dataset is more usable to them. There's a lot of positive incentives that I think is good that we try to emphasize with open context because then we wanna try to get past the sort of notion that data sharing is just, like,

minimal checkbox compliance thing, you know, where, I I threw a couple ugly spreadsheets into a repository and I'm done. Right? We that's not very meaningful.

The data are probably not that useful if they're just chucked,

left as is in Excel with different kinds of color codes and random typos and all sorts of problems in them. It's a lot better if people put in that sort of investment and that intellectual investment in actually trying to

understand how to make their data usable and intelligible by our wider community. And that's where I think the things needs to be much more emphasis on that, and that's where I think it's why this field is actually really interesting. It's that it's hard to communicate meaning with structured data and, especially in a field like this where we have practitioners who are working very particular kinds of things. There's not a tradition of this already established,

and we're trying to build that tradition kind of from scratch. And

what do you have in mind as far

as future

improvements

or future additions or overall goals

for the future of Open Context? Well, I mean, 1 of the big things right now is that we're really excited because we're working with another developer who's

deep, in the guts of the system right now, Raymond Yee. And Raymond has a great background

in, research computing.

He also has he's worked with Unglue It. He's he's done some really wonderful work,

and, he's a he's he's a fantastic Python developer.

And so 1 of the first things that we're working on is making it easier to, deploy open context actually so that, it's so much more of a sort of a 1 line command line argument type of a thing where you just,

you know, can set the whole thing up. So using technologies like Ansible and Docker,

to be able to do that where you can get an instance going,

maybe with some data,

and,

work with it on your own,

play with it, and and hopefully, that'll encourage maybe, more experimentation

and maybe even more people contributing to this,

to the open source project, which would be cool. So there's that. Then we have a lot of needs in terms of, we're bumping up into some scaling issues. Things are starting to slow down as, as as our index has grown. And so just sharding it is is gonna be an important next step. Lots of optimization side things to do in the back end to try to improve speed. And once we've got some of that worked out, then we're gonna be in a much better position to start tackling some really challenging user experience kinds of issues. So 1 of the hard things about archaeology

and especially the approach that we have is that

because

there's so many different ways of describing

these materials,

then,

presenting

search results and making information easy to

retrieve and find is is hard. And

so we need to,

put a lot more investment in in,

trying to make

the exploration

of Open Context

much easier than it is right now because it's it's it it is challenging

because of the diversity of the materials in there. We need to come up with better ways of guiding users to be able to find what they're looking for and then, making sure that they once they found what they're looking for, making sure that they can leave with the the dataset that they're they expect it to have in a format that is going to be useful for for them without too many technical hurdles in the way. And are there any other aspects of Open Context

or research data curation and publication

that we didn't discuss yet that you'd like to cover before we close out the show?

Well, I think that,

again, 1 of the weird things about this and the and the the the harder aspects of this too is that this is a really new area of of research in a lot of ways. And what it really requires

are, you know, people with, the sort of technical programming kinds of skills, data skills, so this would, you know, data science, I guess, and also

library skills, thinking about issues of metadata or preservation,

copyright and intellectual property,

and then and then there's the domain skills,

understanding the domain of archaeology and its specific requirements and why is it that we have a 100 different researchers with and 200 different classification

systems

for the same set of materials. So that's a so there's a lot of these

kinds of areas of expertise that we have to bring together, and we don't really have right now,

institutions that are

well set up to be able to cope with these new needs. We don't really have really good clear career paths and and, ways of sustaining all of this. And so that's so that's 1 of the big challenges that, you know, things are very siloed in universities

where, say, a university librarian

does library things,

academic

researchers publish in conventional journals. We're asking people to come in together and bridge several different kinds of domains

that traditionally haven't really worked closely together. And that's that's that's a real challenge, and that's something that is not just in archaeology, but it's 1 of the real fundamental challenges

of,

managing research data across all sorts of different disciplines.

So we definitely need institutional structures that can help organize that, facilitate that, sustain that, and and and also the people that are involved in that. There's a lot of expertise that has to go into this that doesn't really fit well within conventional structures. So, you know, the, big challenge moving forward is that I think a lot of people real realize and

that research data is important, that we have to curate it well, we wanna preserve it so they can be accessible for people in the future. People can say new things with it, so there's there's all sorts of interest in this, but we still haven't nailed down

exactly how

institutions

are going to,

make this information

broadly accessible and usable for the future,

right now, it it feels still pretty rickety in terms of the sort of sustaining supports. Like, we wanna make sure that this type of thing has becomes

less weird and much more of a regular part of the way research is conducted in the 21st century.

Alright. Well, for anybody who wants to

get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Oh, alright. Well, I mean,

that's a really good question. From our own specific work, we spend a huge amount of time cleaning up datasets.

We use OpenRefine

a lot, and we spend more time with using OpenRefine than actually

munging data within OpenContacts

itself. I think so much of it is that so many of the problems that we encounter

work well with OpenRefine, but there's still that 20%

where, essentially, you have to write custom scripts,

you know, a little Python to be able to

manage data in as in a certain way. And where I think that 1 of the challenges

around all that is, you know, those custom scripts we have,

checked into the source control and everything, but I I I wanna have that as something that it's a lot more

reproducible

and something that,

and people have been working on this with the whole area of reproducible research.

You know, there's been a lot of work, interesting work using things like, Jupyter Notebooks and all that type of thing, but it needs to be a much

it'd be really good to be

able to have these kinds

of preprocessing,

data processing kinds of pipelines

much easier to,

curate and to be able to sort of, I guess, audit, to be able to see exactly what the inputs are, what you did, and what the outputs came out to be so that you can

make sure that if you have a problem, you can trace it, and you can justify

what you did along every step of the way

and do that in a way that is not overly onerous in terms of you know, you also have to get stuff done. Right? So there's this balance between trying to be trying to document everything being reproducible

and also trying to,

actually make sure that, you know, you you can make a deadline, get the data actually out without, spending too much time and resources and then doing that. So I think that that's an interesting area that could use some tooling. I I think it's a really challenging area, but that's something that I'm really, interested in.

Anybody has any suggestions too. Well, I want to thank you very much for taking the time today to join me and discuss the work that you're doing at Open Context. It's definitely a very interesting project, and it's filling a very

useful

and necessary

need. So thank you for all of your work on that and for taking the time tonight, and I hope you enjoy the rest of your evening. Alright. Thanks, Tobias.

Data Engineering Podcast

Summary

Introduction

Interview

Contact Info

Parting Question

Links