The “Always already computational. Collections as data” paradigm coined by a research project of the same name has since 2016 received a strong resonance among libraries, archives and other GLAM institutions worldwide. Many of them strive to offer access to their data and are experimenting with public APIs, dedicated data labs, data dumps, and even closed computing environments. In parallel, the decidedly computational analysis of cultural heritage data has emerged as a vibrant subfield which so far produced a dedicated journal, a conference series, workshops and monographs. We define the subfield “computational humanities” as a distinct user group of computer-savy humanists who wish to analyze cultural (heritage) data at scale harnessing advanced methods from data science and machine learning. This short paper addresses the question to which extent GLAM institutions succeed in meeting the needs of the research community. It is motivated by our ongoing work to create a data lab for the impresso project. impresso aims to break down national and institutional data silos, providing unified access to newspaper and radio archives in Western Europe. The forthcoming impresso data lab strives to facilitate access to such a complex, multilingual and multimodal collection, especially focussing on the “programming historians” as a distinct user group. The presentation will include an overview of the current state of the art in commercially and publicly funded cultural heritage data labs, present user requirements and researcher personas as well as transparency requirements. More specifically, this entails: A survey of data labs for computational humanities research: the first part of our analysis comprises a survey of existing data labs in the fields of cultural heritage and digital humanities. To determine the shape of the impresso data lab, we need to gather ideas and best-practices. We investigate how data labs provide access to collections, e.g. via APIs, data dumps or other means, what type of information they make available (metadata, text, image) and also to what extent these labs achieve to integrate heterogeneous data (or provide access in parallel). Besides access, we inspect whether labs provide computational infrastructure to support the analysis and exploration of their data, for example by allowing users spin-up dedicated VMs, or less costly, Google Colab notebooks or binder environments. We especially focus on the role of notebooks as a bridge between infrastructure and research applications. (Melgar-Estrada, et al. 2019) User requirements and researcher personas: after establishing what exists (in terms of data labs), we elicit user requirements from researchers interested in working with historical media archives at scale. Building an infrastructure, doesn’t automatically mean it will be used by the community (Zundert, 2012). Therefore, in the second part of the presentation, we report on interviews conducted with researchers in the computational humanities. More generally, we will discuss how we envisage to create communities around the tools and models we develop as part of the data lab, and ensure longer-term support and use (Arnold et al., 2019). Requirements for transparency and data-criticism: transparency has been a key value of the impresso project since its inception. The quality of digital research does depend on being able to understand (and control) the process through which data was collected, processed and analyzed. We assess how a data lab can maximize both transparency and utility (allow users to look under the hood and be in charge of the research, without this becoming a burden or hindrance). We discuss various methods to enhance transparency, for example through collecting paradata on the collections by documenting archival knowledge; releasing the models used in processing data; ensuring users can recreate and repurpose data pipelines; facilitating data-criticism through overviews of both present and missing data etc (Beelen et al. 2023). The computational analysis of cultural heritage data has been embraced by both data providers and the digital humanities community. It is, however, not obvious how exactly institutions can effectively and efficiently support research practices. This short presentation will report on the lessons we learned during the survey, interviews, and our own design process.
Show this publication on our institutional repository (orbi.lu).