Summary
Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.
Introduction
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data
Interview
-
Introduction
-
How did you get involved in the area of data management?
I did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management.
-
Can you start by describing what Open Context is and how it started?
Open Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports.
-
What are your protocols for determining which data sets you will work with?
Datasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies.
-
What are some of the challenges unique to research data?
-
What are some of the unique requirements for processing, publishing, and archiving research data?
You have to work on a shoe-string budget, essentially providing "public goods". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices.
Another issues is that it will take a long time to publish enough data to power many "meta-analyses" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests.
-
-
How much education is necessary around your content licensing for researchers who are interested in publishing their data with you?
We require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand.
-
Can you describe the system architecture that you use for Open Context?
Open Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux.
-
What is the process for cleaning and formatting the data that you host?
-
How much domain expertise is necessary to ensure proper conversion of the source data?
That’s one of the bottle necks. We have to do an ETL (extract transform load) on each dataset researchers submit for publication. Each dataset may need lots of cleaning and back and forth conversations with data creators.
-
Can you discuss the challenges that you face in maintaining a consistent ontology?
-
What pieces of metadata do you track for a given data set?
-
-
Can you speak to the average size of data sets that you manage and any approach that you use to optimize for cost of storage and processing capacity?
- Can you walk through the lifecycle of a given data set?
-
Data archiving is a complicated and difficult endeavor due to issues pertaining to changing data formats and storage media, as well as repeatability of computing environments to generate and/or process them. Can you discuss the technical and procedural approaches that you take to address those challenges?
-
Once the data is stored you expose it for public use via a set of APIs which support linked data. Can you discuss any complexities that arise from needing to identify and expose interrelations between the data sets?
-
What are some of the most interesting uses you have seen of the data that is hosted on Open Context?
-
What have been some of the most interesting/useful/challenging lessons that you have learned while working on Open Context?
-
What are your goals for the future of Open Context?
Contact Info
- @ekansa on Twitter
- ResearchGate
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Open Context
- Bronze Age
- GIS (Geographic Information System)
- Filemaker
- Access Database
- Excel
- Creative Commons
- Open Context On Github
- Django
- PostgreSQL
- Apache Solr
- GeoJSON
- JSON-LD
- RDF
- OCHRE
- SKOS (Simple Knowledge Organization System)
- Django Reversion
- California Digital Library
- Zenodo
- CERN
- Digital Index of North American Archaeology (DINAA)
- Ansible
- Docker
- OpenRefine
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello. Welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. Go to data engineering podcast dotcom/linode today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to data engineering podcast.com/chat to join the community and keep the conversation going. Your host is Tobias Macy. And today I'm interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data. So, Eric, could you start by introducing yourself? Hi. Yeah. My name's Eric. I, direct the Open Context Project and a nonprofit
[00:01:15] Unknown:
publishing service for archaeology, archaeological research data. And it is now let's see. Started out in 2006. It's gone through several different iterations over over the past several years, and we currently have about 1, 500, 000 entities that we've published. To talk about more what that means later, from roughly about a 1000 different researchers and, institutions around the world. And it is really intended to be the, a way of sharing a lot of the structured data that comes out of the world of archaeological excavations and archaeological surveys. And do you remember how you first got involved in the area of data management? Yeah. Mostly, it was born out of frustration. I was doing my graduate work, my dissertation work in the late 19 nineties in, archaeology. My background was focused on studying the early bronze age and, the impact of the create the formation of the Egyptian state and civilization on its neighboring regions. And that work involved building some databases and some GIS.
And I was really interested in seeing, you know, how the data that I was creating and look and exploring and how it would relate to the data that other researchers had also developed. And it's super frustrating because there was really no access to their data. So we would get these very, in some ways frustratingly, superficial publications where they would summarize a few results from excavations that were related to what I was looking at, but I was never able to see, like, the structured databases. What was in the what are the actual counts of the different things that they were finding and how did that relate to what I had? So that was 1 of the real reasons why I wanted to get into this area is because it was a there was a real niche there that there's really a strong need to be able to compare these different datasets and try to see the bigger picture. And decades later, we're still working at it, and it's a and it turned out to be a much harder problem than I initially naively thought it would be. And so along that journey, you ended up
[00:03:27] Unknown:
working on building the Open Context platform and organization. So can you describe a bit about what it is that you're doing at Open Context and the mission that you're trying to drive towards and how the overall project got started?
[00:03:42] Unknown:
Yeah. So we recognized early on that there are conventional modes of publishing in in the research world, in academia, were just not sufficient for managing structured data. So conventional articles, books and reports, those are the sort of bread and butter of scholarly communication between researchers and that's that world of publish and perish. Right? So people are publishing at a feverish pace, but those, publications are very difficult to use in any sort of for for any sort of quantified analysis or difficult to aggregate all sorts of different difficult issues with them, in actually reusing the information that they're presenting in these conventional publications. And the other issue is that there's a big issue with archaeology in that it is often relying on destructive methodologies. So when you dig a site, you're actually destroying that site.
And in the process of excavation, you have a really strong professional and ethical imperative to do very detailed recording of what it is that you're encountering when you do that excavation. You have to know where you're finding everything. You have to know stratigraphic relationships between different deposits. You have to know where different architectural features are. There's a huge amount of recording that has to take place, and that recording is actually quite complicated. And it's typically done using databases of 1 form or another to, actually record this excavation process.
And unless we come up with ways of keeping that information, keeping those records in in those databases, then that process of excavation destroying sites, all the information that comes out of that will be lost unless we do something to preserve an archive and share these digital data. So in addition to, you know, opening up new research opportunities, which is 1 of the things that got me excited about this issue about, you know, managing data, we have this really important ethical imperative that actually, you know, this is the way that, we're going to pass an archaeological record down to future generations if we are successful in the management of this research data. So those are the, sort of driving forces behind this, and then open context is our attempt to try to provide, you know, practical real world services in order to try to meet those sort of larger needs of dissemination, opening new research opportunities, and also putting these data into formats, into a larger public context and into digital repositories where they could be preserved. And so the datasets that you're
[00:06:23] Unknown:
managing, are they solely dedicated to archaeological research, or do you have other scientific domains represented as well? The vast majority are archaeological.
[00:06:33] Unknown:
We have 1 test dataset in public health, but for the most part, we're busy enough to with the archaeologists that we're really focused on that need. And the other issue is that our publishing model, the way that we curate the data, really requires some domain expertise. So just because of who we are in terms of our own background and staffing in LVAD, we do focus on archaeology, not other other outside domains where there could be also very different kinds of data modeling and metadata and ontology kinds of concerns that beyond our expertise and comfort level. And in terms of determining which datasets
[00:07:14] Unknown:
you are willing and able to work with, do you have a particular set of guidelines or protocols for when you're first interacting with somebody who comes to you with a particular, set of records that they want to publish on your platform?
[00:07:30] Unknown:
Yeah. Absolutely. It's important for us that the research that we publish is coming from projects that are meeting the, normal standards of professional conduct that are in the discipline of archaeology. And these are professional societies in the field have, you know, ethic codes, professional norms, and there's also laws that govern how archaeology is conducted in different jurisdictions around the world. And so all of those different kinds of standards and norms have to be met for us to, engage with the researcher to publish a dataset. So this is an important thing because with those sorts of conduct professional conduct frameworks help establish, you know, kind that we are working for the archaeological research community, and we don't wanna provide a platform for basically for for for people who are doing treasure hunting, you know. So there there's a a world out there of people who are doing things like, illicit metal detecting and whatnot and treasure hunting that, would actually if if we were to publish that kind of information, may endanger sites. They might get vandalized, they might get looted, and we wanna make sure that we're not working with not facilitating the destruction of cultural heritage.
So this is, so working with a professional community of people who agree to a common set of ethics, that's something that is,
[00:08:52] Unknown:
an important aspect of, who we work with. And in terms of the actual data that you receive from these different research projects and archaeological excavations, what sorts of data formats are you dealing with and some of the unique challenges that come along with the nature of the data that you're dealing with in terms of how it's obtained, how it's recorded, and how it's structured?
[00:09:17] Unknown:
So most archaeologists are not necessarily experts in databases, and, and the domain typically involves a lot of use of pretty normal sort of office suite kind of products. So, the more sophisticated archaeologists are using relational databases, databases like FileMaker or Access, especially. A lot of people use Excel for recording structured data and there's a lot of, variability in the kinds of consistency that of the data that people record. So some people do have different kinds of protocols in place for data validation so that the datasets are more consistent and a lot of people don't. And so there's this kind of an issue that some data need a lot of work and, after the fact cleanup and that could be pretty labor intensive.
1 of the other issues is that archaeologists collaborate when they build their datasets typically. So an a single excavation may have different individuals who are documenting, describing archaeological context, which are different kinds of deposits of dirt that they dig, and they would have different specialists who would be studying different the different classes of materials. So typically, there'd be, say, animal bones that are recovered from an archaeological site and those are described by a zooarchaeologist, somebody who has training in zoology and anatomy.
There could be seeds, charred seed remains that are being described by a botanist. There and there could be different experts who will be studying, different parts aspects of material culture. So pottery, coins, stone tools, metal implements. Some other people might be studying different kinds of, artwork or sculptures or all sorts of different kinds of materials, sometimes even textiles. So there's a huge number of people with very different kinds of expertise and they typically create their own datasets. And the main way that these different datasets can be related to 1 another is through a shared context of where where is it that an object or a bone or a seed was found, which archaeological context, was the source of that material. And 1 of the interesting and challenging bits is that because different researchers are basically managing their own datasets individually, bringing them all together can actually be a lot harder than you think it should be because, you know, if somebody writes, in their Excel spreadsheet that a certain bone comes from locus, which is a common term for an archeological deposit, locus 10, And, they write l dot 10, and then somebody else in a different database is looking at stone tools from that same locust.
They might write locust 10 or just 10. So there's this issue of inconsistent identifiers in order that are used to reference archaeological context. And that could actually be a very big headache in order to relate these different kinds of materials altogether for just 1 archaeological site. And so this is why, there needs to be a lot of investment in trying to go through and understand the different ways that some of these identifiers are expressed and just reconcile them so that you can actually bring the materials together in the way that they should be.
[00:12:41] Unknown:
And in the process of onboarding these different datasets from various research activities or dig sites, As you were saying, there's a lot of domain expertise that's required to be able to make an effective use of that source data and convert it into the schema that you've established for OpenContacts. So can you talk through some of your strategies and tactics for ensuring that you are able to make an appropriate set of transformations for these different datasets and try to extract patterns to make the process more repeatable and also any issues that you encounter during that activity as far as mitigating data loss because of those conversion efforts?
[00:13:29] Unknown:
Yeah. I mean, on the aspect of data loss, whatever the open context is a very, very abstract to generalized global schema. There's not much more to it than sort of almost like a triple store or a key value pair type of thing describing every entity. And we have some common metadata requirements and some common rules of inference around all that that help make things a bit more discoverable. But, really, a researcher who submits their data dataset to us can describe that those data with any sorts of attributes that they want. They often have their own control vocabulary. Sometimes they're gonna be referencing a shared control vocabulary that might be professionally curated. So, know, especially the people work with animal bones, pretty much they're, classifying the species, the biological of the bone in very similar kinds of ways, and so that becomes an easier issue in terms of sharing linking across different classification systems.
But for the most part, people come up with their own idiosyncratic ways of describing stuff. And Open Context's main assumption is that stuff is related via contextual relationships, that a certain record will have relationships to other records that we can describe. So a lot of the you know, we're very sort of flexible, I guess, in terms of the different kinds of schemas that we accept in terms of especially with descriptive attributes. The the hard thing really centers in making sure that we're understanding the identifiers correctly in the source dataset.
So context identifiers are the thing that, we focus a lot of attention on. And the other thing is just, you know, what is it that is being described is 1 of the things that we would care a lot about. So sometimes, depending on the way that somebody structures the dataset, there could be multiple records, say, they're just using a flat table like an Excel spreadsheet. 1, multiple, records are describing the same thing. It's just that they wanna provide add multiple attributes to that 1 thing. So like a a, a coin might have multiple, motifs that are on it, and they would have multiple rows in the spreadsheet to to add that there's multiple motifs on a certain coin or a certain pot shirt or whatever.
And, we need to understand that, okay, this is the same coin, it just has multiple attributes. It's not the a mistake that this thing is repeated over and over again. So that's so there are these those are the sorts of, questions that we have to look at when we look at, mapping the, schemas from a source dataset and going through our extract transform load process and moving things into open context. Yeah. Essentially, it's just that, you know, we have to understand what is being described and, you know, are they being described by attributes that could take on multiple values or not, and then what are the relationships between the things that are being described.
[00:16:30] Unknown:
And for the data that ends up in the open context system, you encourage the use of Creative Commons licensing. So I'm wondering if you've had any issues in terms of needing to educate researchers as far as the implications of that licensing or if you've had any pushback from people who initially approach you once you mention those licensing requirements and just your overall considerations of using Creative Commons as the license category for the data that you are hosting on your platform.
[00:17:05] Unknown:
Yeah. That's, I mean, the whole issue of intellectual property in this space is a huge area of research, and it touches on a bunch of different considerations. So there's, there's a sort of practical interoperability consideration that, you know, if we all use common standard licenses, then the content that we have, would be legally interoperable and, which is nice and that's 1 of the sort of wonderful things about Creative Commons, you know, it's a standard licenses. You can have interoperability, and it's all expressed with standard metadata too so that, you know, in a machine readable way, you can know what the copyright status is of some content. So that's great. But 1 of the big complications is that, you know, it's not just interoperability that we need to optimize, especially in the field like archaeology, but in archaeology, we're dealing with lots of different communities around the world with different sets of values and different assumptions about what the archaeological past means to them and who owns that archaeological past, you know, who who can speak for it. And those kinds of concerns mean that the ethical landscape around using creative commons licenses is a lot more complicated. And so we actually say in open context, you know, open access, open licensing, these are wonderful and powerful tools, but they're not universally appropriate. And so there are definitely going to be circumstances where open context is not a good platform for sharing data. And so if the data have specific kinds of sensitivities, especially, say, if there's indigenous people who regard this information as, important for their own heritage, you know, especially in situations where there's maybe a history of colonialism, then you have to be very careful about, license choices and whether or not a platform like Open Context is an appropriate way of disseminating the data. You might have to find some more restrictive mechanisms in order to take a sort of more judicious and sort of situationally aware approach about all that. So it's, it's an interesting kind of an issue. 1 of the other issues about copyright licenses, like with Creative Commons is that they're copyright licenses. And in the United States, there's this distinction between facts and expressions. So factual data, kinds of things like measurements and, you know, the height of Mount Everest or something like that. Factual data is typically not seen as something that copyright actually covers. So there's a sort of an ambiguity also about what aspects of a dataset are expressive and that are sort of the domain of copyright and what aspects of an archaeological dataset are more factual, where copyright probably doesn't apply anyway. And so no matter what license you apply to it, the copyright is probably not legally applicable anyway. So there's a those are the kinds of issues that we have to sort of walk through with the the research researcher community and also essentially the professional community to try to deal with some of these questions because we wanna make sure that we're trying to use these tools like, interoperability, license interoperability, open data. We wanna have that to lead to good outcomes and not and not harmful outcomes. And so that's why we want people to be, thoughtful about how they're applying these kinds of tools. And further along the topic
[00:20:32] Unknown:
of expressiveness versus just factual information is any research articles that might either accompany or reference the data that is stored in OpenContacts or that's being submitted to OpenContacts. So I'm wondering in terms of the metadata that you track for a given dataset, if there is any either reference or content of any research articles that might be associated with those datasets and how any sort of industry journals or publications for archaeology, any relationship that you might have with them, whether it's positive or ambivalence?
[00:21:15] Unknown:
Yeah. Actually, it's, mostly been very positive in terms of, the sort the conventional publishers that we work with. And there's a couple different issues. 1 is that citation is the main kind of currency in the academic research and promotion, tenure, that type of thing. So the you want we definitely want people to cite datasets because people put a lot of work in creating those datasets, and then they put a lot of work in annotating, describing. And, you know, we put a lot of work working with researchers and cleaning them up and making them hopefully more usable by publishing them with open context. So that effort, we want to recognize a reward because it helps motivate people to do the right thing in sharing data.
And if people get citations and people cite cite published datasets, then that should start nice positive feedback loop of people getting that kind of recognition and then, you know, and hope the ideal scenario is that researchers will do well by doing right, you know, that by sharing data, then they will advance their own careers. So in that sense, we participate with a variety of mechanisms in order to try to make citation easier and, and more meaningful. We OpenContacts uses a service called EasyID, which is hosted by the California Digital Library. California Digital Library is the main sort of institutional digital service, institutional repository for the University of California system. So all the different campuses of the UC system are involved with it, and EZ ID is a service to mint persistent identifiers. And in our case, the persistent identifiers that we meant with EZ ID are we use DOIs for large aggregations of data. In open context, those are called projects. We also use DOIs to identify tables. So another large aggregation of data, which would like a table that would be a dump a CSV dump of maybe thousands of records that we would express.
And then we also use something called arcs, which are called arch archival resource keys, and that's another persistent identifier. And we use that for more granular kinds of content, in order to facilitate citation or something maybe very specific. And this is 1 of the advantages of open context versus, sort of more conventional data repository is that, we can make very specific entities of interest to an archaeologist directly citable. So, say, an example could be a coin. So somebody might discover a coin. It might have some interesting you know, it might be minted maybe in Rome and discovered clear across the empire all the way over in Turkey, and that would be kind of interesting to be able to trace while that really traveled far in antiquity.
And you might wanna reference that specific object to that coin. If we only published, or only curated datasets as they sort of came to us and big tables of spreadsheets and big relational databases, that coin would not be something that you could actually cite because it would be a record, that might be in a giant data table, or it actually might be information that might be scattered on several different data tables. Right? So 1 Excel spreadsheet might describe the coin as described by the numismatist. Another relational database might describe a coin that was just that, you know, as it was recovered by somebody creating an object inventory. And another relational database might describe the contextual relationship, the context where that coin was actually discovered.
And, when we publish things without the context, we're bringing all of that information from all those different source data files together to create a more cohesive picture of what the excavation results are. That means that that information about that 1 entity, that 1 coin, pulling that information out from several different sources and actually making it very convenient to be able to cite and discuss in, publications. And that's that's 1 of the sort of reasons why a lot of researchers find our approach actually kind of interesting and is that, yeah, they wanna be able to talk about a coin. They don't necessarily only wanna cite a spreadsheet, right, that could have thousands of other things in it too.
So that's that's 1 of the key aspects of, all of this is that it's not just the sort of, fitting into that academic cycle of rewards of, you know, the citation attribution is is how you advance your career, but citation is actually a really important aspect where people make sense of things. You know, you want to be able to talk about a specific object or context or cite some other, some other entity of interest. And in order to talk about it, you have to be able to point to it. You have to reference it. And citation bit plays an important aspect in that as well.
[00:26:23] Unknown:
And in terms of the technical architecture that underpins the open context platform and the ability to manage these entity aggregations and expose the data for these direct citations and for being able to discover interrelations between the data. Can you just discuss the overall technical platform that you've built and, some of the challenges and architectural evolutions that have occurred along the way?
[00:26:55] Unknown:
Well, currently, well, first of all, the entire stack, everything that we use is fully open source. And, currently, the current iteration of open context is, it's a Python 3 application using the Django framework, and we have a Postgres relational databases or primary data store in the back end, and we use an Apache Solar index. It's basically a big document, NoSQL database for things like faceted search and not a lot of Open Context is really meant for sort of itself or data analysis or visualization. What we're trying mainly to do is to make the data that we curate browsable, discoverable.
You can, actually look at it, and hopefully, it's somewhat aesthetically pleasing at least. That also matters because it's publishing and there's the aesthetic element is an issue. And, then once it's browsable and discoverable and all that, we wanna try to make it relatively easy to be able to grab the data that you want, and then you can use your own tools that you're comfortable with to, actually do your own analysis and visualization. So mostly for the community of people that we serve, that would be like browsing around, seeing, here's a set of pottery from this 1 archaeological site and then you narrow it down. Okay. So I'm just selecting the the certain time period of my interest, and then you can just export a CSV table of that and then walk away and play with it in Excel, something like that. More sophisticated users can use our API to do, more sophisticated kinds of things with it. The API that we provide provides most all of the data in, GeoJSON, which is a very popular geospatial data format. And you could also usually interpret the data that we have as JSON LD and then convert have that converted into a a graph of RDF triples.
So there's different options for for using it, but but, basically, we're mainly aiming to try to make the data easier to discover so that you can grab and use in the tools that you're interested in. When we initially started, we were working at PHP and had a MySQL background, back end. And I guess I started working in Python probably 2013. It's when we, made the switch, 2013, 2014.
[00:29:22] Unknown:
And in the process of that switch going from MySQL to Postgres and PHP to Django, was there a high level of difficulty in terms of being able to translate the data that you had present on the platform at the time to the new system, or was it a fairly clean mapping where because it was simply just different relational databases, you were able to largely do an export and restore just as a set of SQL statements? Yeah. Actually,
[00:29:50] Unknown:
weirdly, we went through XML as an intermediary. So the the PHP by SQL version of open context, instead of having mainly a sort of GeoJSON structured data format that you could use as a view for all the different items, it was a XML structured data format. And, this goes back to some of the sort of intellectual pedigree of open context. There's a system at the University of Chicago called OCCR, Online Cultural Heritage Research Environment, and the initial incarnation of OCCR, was I mean, it still is an XML database, a native XML database, and they had a schema that they, publicized called archaeological markup language. Essentially, using rqml as our main kind of schema in our own back end, a simplified set of the archaML schema actually.
And so, when we did the export from, migration from the PHP MySQL system in to our current iteration, it was using those XML documents as the source data.
[00:30:56] Unknown:
And as far as the ontology that you maintain and the taxonomical details of the datasets, is there a lot of difficulty as far as being able to map the different entities to those sets of attributes in order to make them more easily linkable so that you can traverse any relations between them. And then in terms of the metadata, I'm curious if you track the dig or the specific research project that the different entities are derived from so that you can have sort of a meta level of linkage. So where you track the different digs in relation to each other, and then from there, dig into the separate entities that are discovered?
[00:31:41] Unknown:
Yeah. So every, every single project, as I mentioned, has and even within a project, there'd be a lot of diversity. Different researchers who are working on different kinds of classes of materials, so like the pottery specialists, the bone specialists, the stone tool specialists, would come up with their own set of descriptive attributes to describe the materials that they're looking at, and, they could have their own control vocabularies and they in archaeology, those are called, typologies usually, sets of measurements that they're interested in, etcetera, etcetera, etcetera. And that is, unfortunately, the sort of state of discipline is that there's not a huge amount of consensus between researchers describing similar kinds of stuff.
So that that diversity is 1 of the reasons why we have the architecture that we have and then in that we have to publish the ways in which archaeologists are describing their materials. We have to publish all of these different kinds of attributes as data also, and that's part of the sort of intellectual contribution that these researchers are creating because, you know, they're usually encountering stuff that very few other people have encountered, and they are making it up as they go along. There's this is an active area of of of of research where people are trying to find new and better ways of describing archaeological deposits or archaeological material culture, that sort of thing. So this is like you know, they have a lot of, intellectual investment, basically, in creating and defining their own descriptive attributes.
So when we publish datasets, we're also publishing a custom set of attributes associated with those datasets. And in order to achieve some interoperability where we can, we use the SCAS, a simple knowledge organization system, to say that, this 1 term in this 1 researcher's data set, you know, let's say, control vocabulary term, might be a close match to this concept in the Getty Art and Architecture Thesaurus. And so that would be a common standard that would be widely applicable in cultural heritage. So a good example might be a type of Etruscan pottery called bukero. So it's a the bukero is this really pretty, sort of dark gray shiny, heavily burnished kind of pot a pottery.
And it is, it might be described in a couple different ways and different researchers' datasets, but, there are, some museums and the Getty Art and Architecture Thesaurus that have their own control vocabulary that describe that. And Getty Art and Architecture Thesaurus is a nice openly licensed link data kind of thesaurus, and then we can say, this term bucurro is a close match for what Getty said how Getty defines it as bucur. And then we could essentially make those those common linkages across these idiosyncratic kinds of descriptions to point to some common standards. And so that is more or less the sort of level of semantic interoperability, I guess, that you would get in open context where we'd we basically start drawing relationships between different vocabulary terms. So it's not that sophisticated.
We're not in the sort of business of supporting sort of inferences on owl ontologies or anything like that. That would be an awesome, thing to do more in the future, especially, if we get to collaborate with some people with those sorts of interest and expertise. But we we do get is the ability to, you know, combine different datasets, and we've been mostly most successful with zooarchaeology, with animal bones, where people have, we've basically seen that kind of technique. You can actually do searches now and discover all the cattle bones and, you know, all the tibia of cattles from dozens of different archaeological projects. And that is a pretty big accomplishment in a field like archaeology where everybody's doing their own thing. And once the data is written into
[00:35:57] Unknown:
open context, is it largely static at that point, or do you have people who will periodically go back through and add additional information or metadata or linkages after the fact? And is that something that users of the platform are able to contribute as suggestions or enhancements to different datasets?
[00:36:19] Unknown:
That would be really interesting to be able to support that. No. We haven't built out that sort of functionality of sort of, sort of user contributed semantic enhancement or or even just, you know, hey. There's a bug here, an error. 1 of the things that we've, been experimenting with and and and and are gonna be moving ahead with in the near term is, also just version control of data. For the most part, most of the data that we publish with OpenContacts is once it's published, it's pretty static. The main changes are usually going to be additions where, you know, we might publish a set of context first, and then the pottery person is ready with their pottery data and we add to that context dataset pottery and etcetera etcetera. So most of it's additions, but sometimes there's gonna be error corrections and and that type of thing.
And we have essentially, because we generate all these JSON LD documents, we have used Git actually as a a version control mechanism for all of that. We're doing more of that internally because it was starting to get unwieldy with Git. We're just building these giant bloated, Git internally, and, we're trying to figure out a good way of of of expressing that publicly of what those changes are. So that's a actually an area of active development, but we do have a sort of an audit trail of how things are changing internally.
[00:37:47] Unknown:
And that brings me into a couple of my next questions. 1 of which is the average size of the typical dataset that you would be processing, just in terms of relative volumes? Are they generally in the megabyte, gigabyte, terabyte scale? And then I know that part of your overall endeavor is also to serve as an archive repository for a lot of this information. So I'd be interested in exploring some of the complications that arise from trying to maintain a long term record of this information, particularly as storage formats
[00:38:28] Unknown:
and physical medium changes over time? Well, I'll just start with the archival issue first. We're not a preservation repository because that involves a whole host of institutional requirements and workflows and processes that, we just don't have the capability, we don't have the scale to be able to achieve ourselves. But what we do do is we when when we publish data, we put them we put the data that we publish into preservation repositories. So, but those are managed by other institutions. So, historically, the main institution that we've worked with and we continue to work with is the California Digital Library, which is, again, the main repository of the University of California system.
We're also using Zenodo, which is another repository that's in basic it's at CERN, the, you know, the the particle physics research lab in in Switzerland. And the rationale for using Zenodo is that they have a very nice convenient API, and they support a lot of metadata using the linked data entities that find useful. And there's already a lot of sort of convergence on using them by other, archaeologists. So it's it's the the material is there with, there's a good, sort of a community of people already engaged with Zenodo. So that's, that's for the for for long term data preservation, we don't, actually, you know, we're not gonna be the end place for it. But what we can do on our end is to try to make that data preservation more feasible, more likely to actually work out in the sense that, when by publishing data with open context, then we're extracting data from a bunch of, different, source datasets with very different kinds of file formats and whatnot and expressing it as structured data and, you know, adjacent JSON LD for the most part, CSV, so these nice simple text formats that are lend themselves to preservation.
And then the other thing is that, oftentimes, we often get a lot of media associated with the structured data that we publish. So, you know, talk about coins or pots or whatever, 1 of the nice things is about having that sort of level of granularity and citation that we have is that we could also do things like relate specific pictures to specific objects or specific pictures or even 3 d models to objects or in context, and we do do that also. And when we publish the material in Open Context, we're also depositing a lot of media files, binary media files of 1 form or another, into digital repositories also. So those are the ways in which we we work for, archival in the archival sense and providing standard metadata and everything that hopefully makes the, job of the digital, archives easier in order to maintain the data. Right. So, I mean, it it varies all over the place. So, some datasets are quite small, say, a megabyte spreadsheet or something like that, and and they and then they go up from there.
What really bloat size or expands size greatly is if somebody has a lot of images or the digital media. So, 3 d files are take up a lot of space, obviously. But, in in certain circumstances, we recently published a set of, some archival materials for the describing the great sphinx at Giza in Egypt. And, you know, even 1 file 1 1 scan image, could be a gigabyte itself. So, and that was that that's and we would have, something like 30, 000. So, that there's sometimes some large space and storage concerns. And, again, we really couldn't do what we do without having that, fundamental infrastructure, the services being provided by the digital repositories, the digital libraries, hosting all these things over the long term because hosting all these things on our own actually starts getting expensive too, with all those files.
[00:42:23] Unknown:
And as you mentioned, you don't do any in house, complicated analyses or visualizations of the data that you are hosting. So I'm wondering what some of the most interesting or unexpected uses
[00:42:40] Unknown:
of your data you have seen out in the wild? Yeah. Probably the most significant use of data for us has been the digital index of North American archeology, and it's the DENA project is this acronym. So what DENA is is a aggregation of data that's a little bit different for us in Open Context is that most of the data in Open Context come to us from academic researchers. DINA, the datasets coming to us in that project are coming to us from state government officials. And so in the United States, there are historical preservation laws that are, at the federal level, and the federal government has delegated the enforcement of those laws to different state governments. And so, each state has officials in charge of, managing cultural properties, historical resources within the state borders.
And each state develops, information systems, so databases basically to help in that administration. And you can imagine there are about 50 different, schemas, datasets. 50 different ways that states have sort of tried to figure out how to manage historical and archaeological sites. And what's interesting about that is that, the state governments have the best window onto the sort of what is known about the past in terms of what is known about the settlement history of the North American continent, since the Pleistocene, because the states have aggregate all of this data that is being required by law to be able to to to to create. And, what's the neat thing about Dina is that we're gathering, by bringing these datasets together from several different states. For the first time, researchers are able to see a much bigger picture about what human settlement is like across the North American context over very deep long periods of time. So you can see what things look like 10000 years ago in terms of what's known about where there are archaeological sites and compare that to what what the situation was like 500 years ago, the beginning of European colonization.
So that is a, a really powerful kind of a resource. There's all sorts of different kinds of really interesting issues in terms of, like, these data sets are also not created with any sort of systematic sampling at all. There's all sorts of biases in them, and that's something that I think is a really fascinating area to explore and it's gonna be a huge challenge for interpretation. But nevertheless, this is this dataset is, inserted as big as it gets in archaeology, for looking at big picture kinds of questions. And the the big research outcome actually, hasn't focused on so much of, like, you know, where people are living in the past, but the big research outcome we've had with the DNID data set has been modeling what's gonna happen to these sites in the future.
And we have most of the coverage of DNIS focuses on the American Midwest and the Southeast, but we have a lot of the, Gulf Coast and the, Atlantic Seaboard in the DNIS dataset. And, this paper that was published in PLOS 1 described what the impact would be with, say, a 1 meter, 2 meter, 3 meter rise in sea level, driven by climate change. And, just with a very, modest 1 meter rise in sea level, which a lot of projections are saying is gonna be happening within, you know, next few decades, we're gonna lose 13, 000 known archeological sites. And, you know, that had a huge amount of press coverage and media impact because a lot of people didn't even know that there were 13, 000 archaeological sites in North America, let alone just along the coast that we'd be losing to, sea level rise.
So that 1 was a big that made a big, impact, pardon the pun. It made a big splash, and it was, recently cited in the 2018, national climate assessment, which was a major, US government report about the challenges of climate change. So that that's probably 1 of the key research outcomes that we've had, with Open Context, and then there have been, several, smaller but still interesting and significant ones. Again, with the zooarchaeological data, people be able to see patterns in the way that animal domestication that which happened first in the, near east in in places like Iraq and Syria and Southeast Turkey and how those, domestic animals and the and the economies around animal husbandry, spread, towards Europe. That's something that was facilitated with Open Context and there's been, publication with that. And then the data that we published originally with that, has been reused in other publications, which is also interesting. So there's actually reuse of this published data for more than 1 research purpose. And then, there's, other applications which are kind of more fun. People using the API to do things like pull in the data for virtual reality or for augmented reality kinds of applications.
Those are smaller scale, but what's fun about that and what's also useful about that is that we really need to build up the technical skills in archaeology. People who have the data skills to be able to engage with the system like Open Context and develop the sort of understanding about how to use a web API and what can you do with it. And so those are the types of things that are maybe in the longer term really important because that starts getting people more engaged with this whole idea of why shared data is important and, you know, what can you do with APIs, or whether it's an open context API. Oh, look, there's data from a completely different source. Maybe you combine it with geo names or combine it with some other, Wikidata or something else. And then and then that starts to maybe, enable some really interesting kinds of applications.
[00:48:42] Unknown:
And in the process of building and maintaining the open context platform and working with researchers and archaeologists, what have been some of the most interesting or useful or challenging lessons that you've learned? 1 of the biggest things is that it is
[00:49:00] Unknown:
incredibly slow to change a disciplinary culture, and and that is that, you know, most academic researchers in the world of publish or perish, and so they have a lot of professional pressures to make publications that they're sure that their peers will recognize as something is valuable. And so that, sometimes leads to some risk aversion, like publishing data is something that's relatively new and it's not a sort of tried and true path to tenure. And so that's 1 of the reasons why it's, adoption has not been as fast as we would want. Fortunately, that situation is changing now that we're starting to see some interesting reuses of data and people are starting to cite it, and it's appearing in bibliographies of research papers and whatnot.
So things are getting are definitely getting better, but it but it has taken, you know, more than 10 years to get to this point, and that and that and, you would think that that would be something that would have been more of a no brainer. Other issues that are also really fascinating is just that, again, that what gets people excited is the the there are a bunch of things that get people excited. So some some research really like seeing their data online. They like to they want to show it off. They want to see it's a great way to have sort of an exhibition. So, you know, there are museums and, other institutions have beautiful websites that show off their collections, and and, researchers are happy to use a service like OpenContacts sometimes so they can participate in that. Show off, you know, what is it that they've discovered and and also show off in a in in a lot of ways that, look, I do such really awesome and rigorous recording. Look at all this wonderful data that I've got. So that's a so that's a a a positive incentive, and so that's something that I think has been underappreciated in the world of research data management, and that a lot of times, there's a sort of discussion of carrots and sticks about this. And and, there's a sort of efforts to make sharing data mandatory and and that, you know, if you want a grant, you have to share your data. And in a lot of ways, I think that's I'm very sympathetic to that. I think that's, you know, definitely the right thing to do, but we also need the positive incentives. So, you know, citation is obviously a positive incentive, people doing coming up with good research outcomes from shared data, that's a good positive incentive. But the other side is just that we want people to feel that publishing and sharing data is something that's recognized rewarding.
It helps make their data look more attractive. It's more usable. Their own dataset is more usable to them. There's a lot of positive incentives that I think is good that we try to emphasize with open context because then we wanna try to get past the sort of notion that data sharing is just, like, minimal checkbox compliance thing, you know, where, I I threw a couple ugly spreadsheets into a repository and I'm done. Right? We that's not very meaningful. The data are probably not that useful if they're just chucked, left as is in Excel with different kinds of color codes and random typos and all sorts of problems in them. It's a lot better if people put in that sort of investment and that intellectual investment in actually trying to understand how to make their data usable and intelligible by our wider community. And that's where I think the things needs to be much more emphasis on that, and that's where I think it's why this field is actually really interesting. It's that it's hard to communicate meaning with structured data and, especially in a field like this where we have practitioners who are working very particular kinds of things. There's not a tradition of this already established, and we're trying to build that tradition kind of from scratch. And
[00:52:39] Unknown:
what do you have in mind as far as future improvements or future additions or overall goals
[00:52:47] Unknown:
for the future of Open Context? Well, I mean, 1 of the big things right now is that we're really excited because we're working with another developer who's deep, in the guts of the system right now, Raymond Yee. And Raymond has a great background in, research computing. He also has he's worked with Unglue It. He's he's done some really wonderful work, and, he's a he's he's a fantastic Python developer. And so 1 of the first things that we're working on is making it easier to, deploy open context actually so that, it's so much more of a sort of a 1 line command line argument type of a thing where you just, you know, can set the whole thing up. So using technologies like Ansible and Docker, to be able to do that where you can get an instance going, maybe with some data, and, work with it on your own, play with it, and and hopefully, that'll encourage maybe, more experimentation and maybe even more people contributing to this, to the open source project, which would be cool. So there's that. Then we have a lot of needs in terms of, we're bumping up into some scaling issues. Things are starting to slow down as, as as our index has grown. And so just sharding it is is gonna be an important next step. Lots of optimization side things to do in the back end to try to improve speed. And once we've got some of that worked out, then we're gonna be in a much better position to start tackling some really challenging user experience kinds of issues. So 1 of the hard things about archaeology and especially the approach that we have is that because there's so many different ways of describing these materials, then, presenting search results and making information easy to retrieve and find is is hard. And so we need to, put a lot more investment in in, trying to make the exploration of Open Context much easier than it is right now because it's it's it it is challenging because of the diversity of the materials in there. We need to come up with better ways of guiding users to be able to find what they're looking for and then, making sure that they once they found what they're looking for, making sure that they can leave with the the dataset that they're they expect it to have in a format that is going to be useful for for them without too many technical hurdles in the way. And are there any other aspects of Open Context
[00:55:16] Unknown:
or research data curation and publication that we didn't discuss yet that you'd like to cover before we close out the show?
[00:55:23] Unknown:
Well, I think that, again, 1 of the weird things about this and the and the the the harder aspects of this too is that this is a really new area of of research in a lot of ways. And what it really requires are, you know, people with, the sort of technical programming kinds of skills, data skills, so this would, you know, data science, I guess, and also library skills, thinking about issues of metadata or preservation, copyright and intellectual property, and then and then there's the domain skills, understanding the domain of archaeology and its specific requirements and why is it that we have a 100 different researchers with and 200 different classification systems for the same set of materials. So that's a so there's a lot of these kinds of areas of expertise that we have to bring together, and we don't really have right now, institutions that are well set up to be able to cope with these new needs. We don't really have really good clear career paths and and, ways of sustaining all of this. And so that's so that's 1 of the big challenges that, you know, things are very siloed in universities where, say, a university librarian does library things, academic researchers publish in conventional journals. We're asking people to come in together and bridge several different kinds of domains that traditionally haven't really worked closely together. And that's that's that's a real challenge, and that's something that is not just in archaeology, but it's 1 of the real fundamental challenges of, managing research data across all sorts of different disciplines.
So we definitely need institutional structures that can help organize that, facilitate that, sustain that, and and and also the people that are involved in that. There's a lot of expertise that has to go into this that doesn't really fit well within conventional structures. So, you know, the, big challenge moving forward is that I think a lot of people real realize and that research data is important, that we have to curate it well, we wanna preserve it so they can be accessible for people in the future. People can say new things with it, so there's there's all sorts of interest in this, but we still haven't nailed down exactly how institutions are going to, make this information broadly accessible and usable for the future, right now, it it feels still pretty rickety in terms of the sort of sustaining supports. Like, we wanna make sure that this type of thing has becomes less weird and much more of a regular part of the way research is conducted in the 21st century.
[00:58:01] Unknown:
Alright. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Oh, alright. Well, I mean,
[00:58:23] Unknown:
that's a really good question. From our own specific work, we spend a huge amount of time cleaning up datasets. We use OpenRefine a lot, and we spend more time with using OpenRefine than actually munging data within OpenContacts itself. I think so much of it is that so many of the problems that we encounter work well with OpenRefine, but there's still that 20% where, essentially, you have to write custom scripts, you know, a little Python to be able to manage data in as in a certain way. And where I think that 1 of the challenges around all that is, you know, those custom scripts we have, checked into the source control and everything, but I I I wanna have that as something that it's a lot more reproducible and something that, and people have been working on this with the whole area of reproducible research.
You know, there's been a lot of work, interesting work using things like, Jupyter Notebooks and all that type of thing, but it needs to be a much it'd be really good to be able to have these kinds of preprocessing, data processing kinds of pipelines much easier to, curate and to be able to sort of, I guess, audit, to be able to see exactly what the inputs are, what you did, and what the outputs came out to be so that you can make sure that if you have a problem, you can trace it, and you can justify what you did along every step of the way and do that in a way that is not overly onerous in terms of you know, you also have to get stuff done. Right? So there's this balance between trying to be trying to document everything being reproducible and also trying to, actually make sure that, you know, you you can make a deadline, get the data actually out without, spending too much time and resources and then doing that. So I think that that's an interesting area that could use some tooling. I I think it's a really challenging area, but that's something that I'm really, interested in.
[01:00:24] Unknown:
Anybody has any suggestions too. Well, I want to thank you very much for taking the time today to join me and discuss the work that you're doing at Open Context. It's definitely a very interesting project, and it's filling a very useful and necessary need. So thank you for all of your work on that and for taking the time tonight, and I hope you enjoy the rest of your evening. Alright. Thanks, Tobias.
Introduction to Eric Kansa and Open Context
Challenges in Archaeological Data Management
Mission and Goals of Open Context
Data Formats and Domain Expertise
Transforming and Curating Datasets
Creative Commons Licensing and Ethical Considerations
Citation and Collaboration with Academic Journals
Technical Architecture of Open Context
Ontology and Metadata Management
Static vs. Dynamic Data and Version Control
Dataset Sizes and Long-term Archival
Interesting Uses of Open Context Data
Lessons Learned in Data Curation
Future Goals and Improvements
Challenges in Research Data Management
Closing Remarks and Contact Information