ExploreCor - Using Programmable Corpora in Computational Literary Studies

About the Training School
The academic training school, “ExploreCor: Using Programmable Corpora in Computational Literary Studies,” took place in Vienna in June 2024, spanning three intensive days. The unique programme was designed to provide participants with a comprehensive understanding of the research cycle in Computational Literary Studies, equipping them with the skills needed to navigate the evolving landscape of digital humanities.
The training school began by delving into the critical process of finding and evaluating corpora of literary texts. Participants explored the concept of Programmable Corpora, a pivotal aspect of the curriculum. Programmable Corpora are dynamic collections of literary works that can be manipulated programmatically, allowing for customised and nuanced analyses.
The curriculum progressed to the formulation of research questions and the subsequent execution of analyses using Python and Jupyter Notebooks. Attendees utilised the tool DraCor, designed for efficient and flexible literary text analysis. Additionally, the training incorporated the CLSCor catalogue, a Linked Data-powered resource developed within the CLS INFRA project, enabling students to explore and select corpora for their research.
An integral component of the training focused on APIs and Linked Data, emphasising the interconnected nature of literary datasets. Students engaged in exemplary research projects using the DraCor system, gaining hands-on experience in navigating digital literary networks. The training school addressed the issue of Repeatable Research, a challenge in the Digital Humanities landscape, and explored methods to ensure research replicability.
Digital Literary Network Analysis is a key topic covered in the programme, providing participants with the tools to uncover intricate relationships within literary texts. A dedicated segment addressed the Reproducibility Crisis in Digital Humanities, underscoring the importance of transparent and replicable research practices.
To ensure the longevity and accessibility of research outcomes, the training school introduced Docker as a valuable tool. Participants learnt how to leverage Docker to encapsulate their research environments, enhancing the reproducibility of their findings.
By combining theoretical foundations with practical applications, participants left equipped to navigate the complexities of programmable corpora, digital literary analysis, and reproducible research, contributing to the ongoing advancement of the field.
Credits
This training event was organised with the considerable effort of many people, many of whom were playing multiple roles. The list of people by role is shown in these tabs.
Preparatory Information
Software Downloads
The following software was used during more practical aspects of this workshop. We invite you to download the following software if you wish to use this learning resource as a practical guide to certain methods and techniques.
Gephi
Download and install the latest version of “Gephi” (https://gephi.org)
Docker
“Docker Desktop”: (https://www.docker.com/products/docker-desktop/)
After having installed “Docker”, go to https://github.com/dracor-org/dracor-explorecor and follow the instructions in the “Setup of a Local DraCor Environment” section. At the end of this process, there should be a local Jupyter Lab instance running under http://localhost:8889
Further Reading
Börner, I., & Trilcke, P. (2023). CLS INFRA D7.1 On Programmable Corpora (v1.0.0). Zenodo. (https://doi.org/10.5281/zenodo.7664964)
Börner, I., & Trilcke, P. (2024). CLS INFRA D7.3 On Versioning Living and Programmable Corpora (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.11081934
Ďurčo, M., Charvat, V. M., Börner, I., Mrugalski, M., & Odebrecht, C. (2022). CLS INFRA D6.1 Inventory of existing data sources and formats. Zenodo. (https://doi.org/10.5281/zenodo.7520287)
Ďurčo, M., Charvát, V. M., & Resch, S. (2025). CLS INFRA D6.2 Transformation toolbox & ingest and processing workflow. Zenodo. https://doi.org/10.5281/zenodo.14998374
Mrugalski, M., Odebrecht, C., Charvat, V., Börner, I., & Durco, M. (2022). CLS INFRA D5.1. Review of the Data Landscape. Zenodo. (https://doi.org/10.5281/zenodo.6861022)
Schöch, C. (2023) Repetitive research: a conceptual space and terminology of replication, reproduction, revision, reanalysis, reinvestigation and reuse in digital humanities. _Int J Digit Humanities_ 5, 373–403 . (https://doi.org/10.1007/s42803-023-00073-y)
1.Introduction to Programmable Corpora
For Computational Literary Studies, one research object has proven to be of particular relevance that hardly plays a role in traditional literary studies: the corpus. In this introduction to the “ExploreCor” Training School, we will firstly reflect on working with literary corpora in Computational Literary Studies.
2.Introducing DraCor
For Computational Literary Studies, one research object has proven to be of particular relevance that hardly plays a role in traditional literary studies: the corpus. In this introduction to the “ExploreCor” Training School, we present DraCor, the Drama Corpora Project, and its digital ecosystem as a prototype of “Programmable Corpora”.
3.Using DraCor: Four Showcases
For Computational Literary Studies, one research object has proven to be of particular relevance that hardly plays a role in traditional literary studies: the corpus. In this part of the introduction to the “ExploreCor” Training School we will demonstrate how to use the provided dockerised DraCor research environment and the bundled Jupyter Lab instance to do research with the DraCor API.
4.Introduction to Linked Open Data for Beginners
Linked Open Data (LOD) refers to datasets that are publicly available, can be linked to other datasets, and can be interpreted and used not only by humans but also by machines. This presentation outlines the main principles of LOD and its underlying framework, the Semantic Web. Technical aspects are also covered, including how to formalise LOD using the Resource Description Framework (RDF), ontologies, and controlled vocabularies; how to express it using a human-readable syntax (Turtle); and how to search through it using the SPARQL query language.
5.Exploring Programmable Corpora 1: A Case Study in Conducting Research with the DraCor API
A key component of Programmable Corpora is the research-driven API, which makes it particularly easy to retrieve and process corpus data that have been generated for specific research questions. The DraCor API was developed especially for the network analysis method. This session introduces the method of network analysis and presents the DraCor API and its possibilities in detail.
6.Exploring Programmable Corpora 2: Introducing Network Analysis
Continuing to demonstrate how DraCor can be used for network analysis, the first part of this session is an introduction to dramatic network analysis. It outlines the theoretical background of dramatic network analysis. Second, it presents a range of network measures relevant to the analysis of networks in dramatic literature. Finally, it discusses the potential of dramatic network analysis in the field of literary studies.
7.Reproducibility 1: Replication or Prediction or What?
As a result of the so-called “reproducibility crisis” making research repeatable has become a crucial topic in empirical and technical sciences. In Computational Literary Studies (CLS), there is still a shortage of both a culture of repetitive research and user-friendly technical solutions. In this talk Christof Schöch introduces a theoretical framework to describe modes of repeating research in Digital Humanities.
8.Reproducibility 2: Reproducible Research with DraCor
This session demonstrates how to use the available Docker images of DraCor and GitHub to setup stable local DraCor corpora to allow for replication of research. After viewing this session, learners should be able to create their own custom corpora and learn about strategies for sharing them.