Use of vocabularies for metadata curation and quality assessment in Social Sciences and Humanities

The event “Use of vocabularies for metadata curation and quality assessment in social sciences and humanities” was organised as part of the Horizon 2020 funded project TRIPLE (Grant Agreement No. 863420.). https://cordis.europa.eu/project/id/863420 The starting point for the organisation of this event was to analyse the challenges faced to create the multilingual GoTRIPLE vocabulary and compare it with other initiatives recently undertaken in the Social Sciences and Humanities branch of the European Science Cloud (EOSC) context.
Learning outcomes
This event resource targets an audience already knowledgeable in the field. By following the resources, you should be able to:
- have a better understanding the current situation concerning topical vocabularies in the SSH context and the interoperability challenges faced within/by the SSH branch of the EOSC;
- be familiar with some initiatives related to metadata curation and enrichment in the SSH;
- apply best practices solutions when applying a vocabulary to a dataset.
Participants
- Tomasz Umerle (IBL-PAN)
- Klaus Illmayer (ACDH-CH)
- Péter Király (GWDG)
- Ceri Binding (University of South Wales / ARIADNE)
- Julien Homo (FoxCub)
- Alessia Bardi (CNR-ISTI / OpenAire)
- Antoine Isaac (Europeana)
- Matej Durco (ACDH-CH / DARIAH)
- Cesare Concordia (CNR-ISTI)
- Erzsébet Tóth-Czifra (DARIAH)
- Laure Barbot (DARIAH)
- Marco Raciti (DARIAH)
- Canan Hastik (DALIA Data Literacy Alliance / NFDI EduTrain / TaDiRaH)
- Nikodem Wołczuk (IBL-PAN)
- Maja Dolinar (ADP / CESSDA)
- Massimiliano Carloni (ACDH-CH)
- Nina C. Rastinger (ACDH-CH)
- Arnaud Gingold (OPERAS)
- Najla Rettberg (RDA)
Organisation Team
Laure Barbot, Matej Durco, Marco Raciti
Photos
Photos of the event can be found here: https://www.flickr.com/photos/142235661@N08/albums/with/72177720307257293
Bibliography
- Daan Broeder, Nikos Vasilogamvrakis, & Iraklis Katsaloulis. (2022, April 28). TRIPLE Open Science Training Series: Multilingual Vocabularies for SSH (20 April 2022). Zenodo. https://doi.org/10.5281/zenodo.6501586
- Frontini, Francesca; Gamba, Federica; Monachini, Monica and Broeder, Daan, 2021, SSHOC Multilingual Data Stewardship Terminology, ILC-CNR for CLARIN-IT, http://hdl.handle.net/20.500.11752/ILC-567
- Frontini, Francesca; Gamba, Federica; Monachini, Monica and Broeder, Daan, 2021, SSHOC Multilingual Metadata, ILC-CNR for CLARIN-IT, http://hdl.handle.net/20.500.11752/ILC-568
- Georgiadis, Haris, Blaszczynska, Marta, & Maryl, Maciej. (2023). TRIPLE Deliverable: D2.4 Report on identification and creation of new vocabularies (DRAFT). Zenodo. https://doi.org/10.5281/zenodo.7539922
- Harpring, P. (2010). Introduction to Controlled Vocabularies: Terminology for Art, Architecture, and Other Cultural Works. Los Angeles: Getty Research Institute. ISBN: 978-1-60606-026-1
- Le Franc, Yann, Parland-von Essen, Jessica, Bonino, Luiz, Lehväslaiho, Heikki, Coen, Gerard, & Staiger, Christine. (2020). D2.2 FAIR Semantics: First recommendations (1.0). FAIRsFAIR. https://doi.org/10.5281/zenodo.5361930
- Lippell, Helen, ed. Taxonomies: Practical Approaches to Developing and Managing Vocabularies for Digital Information. London: Facet Publishing, 2022.
- Pomerantz, Jeffrey. Metadata. Cambridge, MA: MIT Press, 2015, pp. 48-54 («Metadata Gone Wild!»)
- Trupiano, Luca and Concordia, Cesare, 2021, SSHOC Data Stewardship terminology and Metadata SKOSifying mapping, ILC-CNR for CLARIN-IT, http://hdl.handle.net/20.500.11752/ILC-566
- Zaytseva Ksenia and Ďurčo Matej (2020). Controlled Vocabularies and SKOS. Version 1.1.0. Edited by Matej Ďurčo and Tanja Wissik. DARIAH-Campus. [Training module]. https://campus.dariah.eu/id/D8d6OrLdpLlGRqBSQDVN0
Resources
- SSH Vocabulary Commons: https://vocabs.sshopencloud.eu/browse/en/
- Multilingual Vocabularies for SSH Training, this training session, available on DARIAH-Campus, focuses on controlled vocabularies for Social Sciences and Humanities (SSH): https://campus.dariah.eu/resource/posts/multilingual-vocabularies-for-ssh-training
- Basic Register of Thesauri, Ontologies & Classifications (BARTOC) https://bartoc.org/
- SSHOC Multilingual Data Stewardship terminology and Multilingual Metadata SKOSifying mapping: https://gitea-s2i2s.isti.cnr.it/concordia/sshoc-skosmapping
- TRIPLE vocabulary: https://www.semantics.gr/authorities/vocabularies/SSH-LCSH/vocabulary-entries
- Perio.do, a public domain gazetteer of scholarly definitions of historical, art-historical, and archaeological periods: https://perio.do/en/
- ARIADNE Vocabulary Matching Tool: https://vmt.ariadne.d4science.org/vmt/vmt-app.html
- SSH Open Marketplace: https://marketplace.sshopencloud.eu/
- OpenAIRE Graph, an open resource that aggregates a collection of research data properties (metadata, links) available within the OpenAIRE Open Science infrastructure to interlink information by using a semantic graph database approach: https://graph.openaire.eu/
- EUROPEANA: https://www.europeana.eu/
- QA catalogue, a metadata quality assessment tool for library catalogue records: https://github.com/pkiraly/metadata-qa-marc
- RDA TIGER, a new project tasked with providing support services for the Working Groups (WGs) within the Research Data Alliance. The idea of starting an RDA Working Group “Multilingual Vocabulary Alignment” was presented during the event: https://www.rd-alliance.org/rda-tiger
- dariah.lab: https://lab.dariah.pl/en/
1.SSH Vocabulary Commons
The SSH Vocabulary Commons brings together experts and managers from SSH research infrastructures CESSDA, CLARIN, DARIAH and E-RIHS that agree to share their expertise and work towards common recommendations for firstly, using and managing vocabularies used in SSH and secondly operating, sharing and managing of vocabulary services useful for the broad SSH community.
The SSH Vocabulary Commons work is specifically directed towards:
- Common recommendations for creating, managing and using vocabularies used in SSH research and resource management that will make the SSH infrastructures better interoperable and more efficient.
- Aligning current vocabulary management practices with EOSC and FAIR principles making vocabularies first-class citizens, easy to find, share and access.
- Improving facilities for multilingual vocabularies.
- Promoting sharing of vocabularies and management procedures and software between different SSH domains and organizations by recommending specific Knowledge Organization/Representation language formats e.g. SKOS and how to apply these e.g. vocabulary metadata, versioning
- Providing easier cross domain integration of (meta)data and semantic interoperability by supporting and providing procedures for vocabulary matching
- Maintaining contacts with vocabulary software development teams and promote SSH infrastructure interests
- Facilitating vocabulary recommendation, supporting researchers finding relevant vocabulary for their research.
- Providing a default vocabulary hosting and publishing service for orphaned vocabularies
2.Publishing SSHOC Multilingual Terminologies
The SSHOC Multilingual Terminologies consist of a set of multilingual metadata concepts and an automatically extracted multilingual Data Stewardship terminology. They have been created in the context of the SSHOC project as case studies, during an activity of investigation on how Language Technologies can help in promoting and facilitating multilingualism in the Social Sciences and Humanities. The SSHOC Multilingual terminologies has been created to evaluate performances of state-of-the-art tools and to derive a set of recommendations as to how best apply them. The talk will present the main recommendations derived from the investigation and the workflow and tools adopted to publish the Terminologies as SKOS resources.
3.How to use vocabularies to enrich GoTriple ?
The GoTriple platform is a discovery service for SSH publications. During the ingestion pipeline, it transforms metadata records into the Triple Data Model, it performs a series of cleansing, normalization and enrichment procedures - in order to deal with metadata heterogeneity, increase multilingualism and improve content searchability and discoverability - and, finally, it stores and indexes the enriched metadata records, making them searchable via the GoTriple search engine. During this process, many vocabularies are involved having a strategic role in the quality of the results and by extension in the satisfaction of the end users of the platform. What are these vocabularies? Where are they used and how? What are the difficulties encountered in the framework of GoTriple? We will try to answer these questions during the session.
4.Using controlled vocabularies to organise keywords in the SSH
Controlled vocabularies prove useful for improving the quality of topical keywords describing SSH publications. This presentation deals with the challenge of improving the quality of keywords which are expressed in the form of strings (keywords-strings). Experiences from Dariah.lab project (lab.dariah.pl) and TRIPLE project show that these keywords-strings on many occasions originate from controlled vocabularies if they were applied by institutional actors (e.g. data officers in digital libraries) and hence can be easily enriched with references to controlled vocabularies. On the other hand, keywords-strings from authors are less frequently originating from these kinds of resources. This presentation outlines automated and semi-automated methods for enriching both types of keywords-strings with references to controlled vocabularies.
5.Multilinguality and the use of vocabularies at Europeana
Europeana gathers millions of metadata records from over 3500 libraries, archives and museums across Europe. Such a diversity with many quality issues (sparseness, heterogeneity, multilinguality). We present how Europeana and its partners work to address these issues, in a collaborative and flexible fashion. The Europeana Data Model paves the way for gathering richer metadata, but all parties need to exploit this potential. Especially, populating the data model’s elements with multilingual linked open data vocabularies can bring a lot of value. We encourage our partners to publish richer metadata that exploits the controlled vocabularies they often already use. We also perform automatic metadata enrichment, using a purpose-built linked-data based “entity collection” as target. Finally, partners have also embarked on producing their own automatic metadata enrichments. In parallel, the Europeana network has developed a tier system for the Europeana Publishing Framework, which enables measuring and reporting on important aspects of metadata quality in the Europeana datasets.
6.ARIADNEplus vocabulary mapping strategy
The ARIADNEplus project aggregated over 3.5 million metadata records describing archaeological sites. These records were converted and integrated to an RDF triple store using a data schema based on the CIDOC Conceptual Reference Model (CRM), then exported to an OpenSearch index for use with the search portal UI. Search is via 3 major facets - place, time and subject. This presentation primarily concerns the subject integration work which involved 59 local vocabularies containing over 19,000 subject terms in 16 different languages. A vocabulary matching tool allowed each data provider to define correctly formatted mappings from their own local vocabularies to a central common ‘spine’ vocabulary (Getty Art & Architecture Thesaurus, AAT). Using the resultant mappings, the metadata records were enriched with AAT identifiers. The AAT poly-hierarchical structure was incorporated into the search system, facilitating both multilingual cross search and hierarchical subject expansion. This approach led to a noticeable improvement in both recall and precision.
7.Vocabularies in the SSH Open Marketplace
SSH Open Marketplace (https://marketplace.sshopencloud.eu) acts as a place to get information on tools, services, training materials, publications, datasets, and workflows from the Social Sciences and Humanities domain. Users can propose to add new items or to extend the information about existing items. SSH Open Marketplace data model operates with dynamic fields that can be connected to vocabularies. The presentation gives information about the technical background of the integration of vocabularies and shares experiences with this approach. On the ‘SSH vocabs commons’ project the used vocabularies of the SSH Open Marketplace are represented. Ideas on how to further develop the ‘SSH vocabs commons’ project are discussed from the perspective of the SSH Open Marketplace.
8.Vocabularies in the OpenAire Graph
The OpenAIRE Graph is an open collection of metadata about entities of the research life cycle (publication, research data, software, other types of research results, project grants, organisations, data sources) semantically linked with each other. Metadata and links are collected from trusted scholarly communication sources (like institutional repositories, data archives, publishers’ platforms, Crossref, DOAJ) and enriched thanks to data mining algorithms that detect duplicates, semantic relationships and subject classification. The talk will present how controlled vocabularies are used in the OpenAIRE Graph, focusing on those used for subject classification and how they are leveraged by the OpenAIRE services CONNECT, EXPLORE and MONITOR.
9.Vocabularies and quality improvement of library catalogues
QA catalogue is a metadata quality assessment tool developed for MARC21 and PICA metadata schema based library catalogues. Catalogues use vocabularies internally for recording their own structure and the terms particular data elements might contain, and externally for enhancing the descriptive data elements. This latter category covers two main areas: subject indexing and normalisation of (personal, family, corporate, geographical) names, titles and events. The tool reads MARC or PICA files (in different formats), analyses some quality dimensions, and saves the results into CSV files.
10.Comparing SSH vocabularies and their applications in different systems
Vocabularies are used in a wide range of services. These include long-term digital archives as well as research project overview websites and training platforms. Vocabularies allow depositors and users to describe resources by indicating their most relevant aspects, usually summarised in single keywords. Analysing vocabularies is therefore a productive way of discovering which aspects of a research area are covered by a particular service, and how services relate to each other. This hands-on session presents a practical workflow for comparing vocabularies from different services and identifying possible overlaps. ‘Vocabularies’ is used here in a broad sense, including both controlled vocabularies (ranging from controlled lists to thesauri) and folksonomies, i.e. collections of uncontrolled keywords used in a service. A Jupyter notebook is included, containing the Python code needed to perform the comparison. A set of vocabularies are already provided for comparison (such as the SSH Open Marketplace keywords and Getty AAT). In addition, an accompanying presentation illustrates some preliminary thoughts on vocabulary terminology, usage and interoperability (with a special focus on the users’ perspective) as well as the description of a comparison between vocabularies of the Austrian DH community recently performed at the Austrian Centre for Digital Humanities and Cultural Heritage.
Link to the Jupyter notebook: https://zenodo.org/record/7845914#.ZEEG5HbP0Uo