Use of vocabularies for metadata curation and quality assessment in Social Sciences and Humanities

Use of vocabularies for metadata curation and quality assessment in Social Sciences and Humanities

Location

DARIAH Coordination Office at Centre Marc Bloch, Berlin

Date

27–28 March 2023

Authors

Topics

Organized by

DARIAH-EU

The event “Use of vocabularies for metadata curation and quality assessment in social sciences and humanities” was organised as part of the Horizon 2020 funded project TRIPLE (Grant Agreement No. 863420.). https://cordis.europa.eu/project/id/863420 The starting point for the organisation of this event was to analyse the challenges faced to create the multilingual GoTRIPLE vocabulary and compare it with other initiatives recently undertaken in the Social Sciences and Humanities branch of the European Science Cloud (EOSC) context.

Learning outcomes

This event resource targets an audience already knowledgeable in the field. By following the resources, you should be able to:

have a better understanding the current situation concerning topical vocabularies in the SSH context and the interoperability challenges faced within/by the SSH branch of the EOSC;
be familiar with some initiatives related to metadata curation and enrichment in the SSH;
apply best practices solutions when applying a vocabulary to a dataset.

Participants

Tomasz Umerle (IBL-PAN)
Klaus Illmayer (ACDH-CH)
Péter Király (GWDG)
Ceri Binding (University of South Wales / ARIADNE)
Julien Homo (FoxCub)
Alessia Bardi (CNR-ISTI / OpenAire)
Antoine Isaac (Europeana)
Matej Durco (ACDH-CH / DARIAH)
Cesare Concordia (CNR-ISTI)
Erzsébet Tóth-Czifra (DARIAH)
Laure Barbot (DARIAH)
Marco Raciti (DARIAH)
Canan Hastik (DALIA Data Literacy Alliance / NFDI EduTrain / TaDiRaH)
Nikodem Wołczuk (IBL-PAN)
Maja Dolinar (ADP / CESSDA)
Massimiliano Carloni (ACDH-CH)
Nina C. Rastinger (ACDH-CH)
Arnaud Gingold (OPERAS)
Najla Rettberg (RDA)

Organisation Team

Laure Barbot, Matej Durco, Marco Raciti

Photos

Photos of the event can be found here: https://www.flickr.com/photos/142235661@N08/albums/with/72177720307257293

Bibliography

Daan Broeder, Nikos Vasilogamvrakis, & Iraklis Katsaloulis. (2022, April 28). TRIPLE Open Science Training Series: Multilingual Vocabularies for SSH (20 April 2022). Zenodo. https://doi.org/10.5281/zenodo.6501586
Frontini, Francesca; Gamba, Federica; Monachini, Monica and Broeder, Daan, 2021, SSHOC Multilingual Data Stewardship Terminology, ILC-CNR for CLARIN-IT, http://hdl.handle.net/20.500.11752/ILC-567
Frontini, Francesca; Gamba, Federica; Monachini, Monica and Broeder, Daan, 2021, SSHOC Multilingual Metadata, ILC-CNR for CLARIN-IT, http://hdl.handle.net/20.500.11752/ILC-568
Georgiadis, Haris, Blaszczynska, Marta, & Maryl, Maciej. (2023). TRIPLE Deliverable: D2.4 Report on identification and creation of new vocabularies (DRAFT). Zenodo. https://doi.org/10.5281/zenodo.7539922
Harpring, P. (2010). Introduction to Controlled Vocabularies: Terminology for Art, Architecture, and Other Cultural Works. Los Angeles: Getty Research Institute. ISBN: 978-1-60606-026-1
Le Franc, Yann, Parland-von Essen, Jessica, Bonino, Luiz, Lehväslaiho, Heikki, Coen, Gerard, & Staiger, Christine. (2020). D2.2 FAIR Semantics: First recommendations (1.0). FAIRsFAIR. https://doi.org/10.5281/zenodo.5361930
Lippell, Helen, ed. Taxonomies: Practical Approaches to Developing and Managing Vocabularies for Digital Information. London: Facet Publishing, 2022.
Pomerantz, Jeffrey. Metadata. Cambridge, MA: MIT Press, 2015, pp. 48-54 («Metadata Gone Wild!»)
Trupiano, Luca and Concordia, Cesare, 2021, SSHOC Data Stewardship terminology and Metadata SKOSifying mapping, ILC-CNR for CLARIN-IT, http://hdl.handle.net/20.500.11752/ILC-566
Zaytseva Ksenia and Ďurčo Matej (2020). Controlled Vocabularies and SKOS. Version 1.1.0. Edited by Matej Ďurčo and Tanja Wissik. DARIAH-Campus. [Training module]. https://campus.dariah.eu/id/D8d6OrLdpLlGRqBSQDVN0

Resources

SSH Vocabulary Commons: https://vocabs.sshopencloud.eu/browse/en/
Multilingual Vocabularies for SSH Training, this training session, available on DARIAH-Campus, focuses on controlled vocabularies for Social Sciences and Humanities (SSH): https://campus.dariah.eu/resource/posts/multilingual-vocabularies-for-ssh-training
Basic Register of Thesauri, Ontologies & Classifications (BARTOC) https://bartoc.org/
SSHOC Multilingual Data Stewardship terminology and Multilingual Metadata SKOSifying mapping: https://gitea-s2i2s.isti.cnr.it/concordia/sshoc-skosmapping
TRIPLE vocabulary: https://www.semantics.gr/authorities/vocabularies/SSH-LCSH/vocabulary-entries
Perio.do, a public domain gazetteer of scholarly definitions of historical, art-historical, and archaeological periods: https://perio.do/en/
ARIADNE Vocabulary Matching Tool: https://vmt.ariadne.d4science.org/vmt/vmt-app.html
SSH Open Marketplace: https://marketplace.sshopencloud.eu/
OpenAIRE Graph, an open resource that aggregates a collection of research data properties (metadata, links) available within the OpenAIRE Open Science infrastructure to interlink information by using a semantic graph database approach: https://graph.openaire.eu/
EUROPEANA: https://www.europeana.eu/
QA catalogue, a metadata quality assessment tool for library catalogue records: https://github.com/pkiraly/metadata-qa-marc
RDA TIGER, a new project tasked with providing support services for the Working Groups (WGs) within the Research Data Alliance. The idea of starting an RDA Working Group “Multilingual Vocabulary Alignment” was presented during the event: https://www.rd-alliance.org/rda-tiger
dariah.lab: https://lab.dariah.pl/en/

1.SSH Vocabulary Commons
The SSH Vocabulary Commons brings together experts and managers from SSH research infrastructures CESSDA, CLARIN, DARIAH and E-RIHS that agree to share their expertise and work towards common recommendations for firstly, using and managing vocabularies used in SSH and secondly operating, sharing and managing of vocabulary services useful for the broad SSH community.

The SSH Vocabulary Commons work is specifically directed towards:
- Common recommendations for creating, managing and using vocabularies used in SSH research and resource management that will make the SSH infrastructures better interoperable and more efficient.
- Aligning current vocabulary management practices with EOSC and FAIR principles making vocabularies first-class citizens, easy to find, share and access.
- Improving facilities for multilingual vocabularies.
- Promoting sharing of vocabularies and management procedures and software between different SSH domains and organizations by recommending specific Knowledge Organization/Representation language formats e.g. SKOS and how to apply these e.g. vocabulary metadata, versioning
- Providing easier cross domain integration of (meta)data and semantic interoperability by supporting and providing procedures for vocabulary matching
- Maintaining contacts with vocabulary software development teams and promote SSH infrastructure interests
- Facilitating vocabulary recommendation, supporting researchers finding relevant vocabulary for their research.
- Providing a default vocabulary hosting and publishing service for orphaned vocabularies
Speakers
Daan Broeder
Daan Broeder has a background in electrical engineering and signal analysis, and has a long career working on research infrastructure, working in different capacities at different CLARIN centres and was managing tasks in European and national projects such as for the archiving and metadata related work at MPI for Psycholinguistics TLA unit, for which he was the CTO, and broader in the CLARIN, DASISH, EUDAT and PARTHENOS projects. He was responsible convenor for ISO standards on metadata and persistent identifiers. He was one of the technical coordinators in the Dutch CLARIN and CLARIAH project and currently participates in EOSC Future and FAIRCORE4EOSC projects for CLARIN ERIC.
Matej Ďurčo
Matej Ďurčo is Chief Technology Officer for DARIAH-EU. He is also head of the technical group “Digital Humanities Research & Infrastructure” at the Austrian Centre for Digital Humanities and Cultural Heritage at the Austrian Academy of Sciences and one of the key figures in founding this institute. Since 2009, he has also participated actively in the Austrian research infrastructures core group and supported the build-up of the pan-European research infrastructures CLARIN and DARIAH, both on the Austrian and European levels. Besides building technical infrastructures, his main concern has been the social aspect, sharing knowledge and acting as interpreter between humanities research and the technical world ensuring mutual understanding in order to find optimal solutions for the specific needs of the researchers.
Link
Synthesis
2.Publishing SSHOC Multilingual Terminologies
The SSHOC Multilingual Terminologies consist of a set of multilingual metadata concepts and an automatically extracted multilingual Data Stewardship terminology. They have been created in the context of the SSHOC project as case studies, during an activity of investigation on how Language Technologies can help in promoting and facilitating multilingualism in the Social Sciences and Humanities. The SSHOC Multilingual terminologies has been created to evaluate performances of state-of-the-art tools and to derive a set of recommendations as to how best apply them. The talk will present the main recommendations derived from the investigation and the workflow and tools adopted to publish the Terminologies as SKOS resources.
Speaker
Cesare Concordia
Cesare Concordia is a researcher in Computer Science at the ISTI-CNR institute in Pisa, Italy, where he has been working on topics related to distributed Information Systems and Digital Libraries. He is a member of the AI for Media and Humanities (AIMH) laboratory of ISTI. His research interests includes also: semi-structured databases, Service Oriented Architectures (SOA) and Semantic Web frameworks. He has been involved in a number of projects, both national and EU funded.
Link
Synthesis
3.How to use vocabularies to enrich GoTriple ?
The GoTriple platform is a discovery service for SSH publications. During the ingestion pipeline, it transforms metadata records into the Triple Data Model, it performs a series of cleansing, normalization and enrichment procedures - in order to deal with metadata heterogeneity, increase multilingualism and improve content searchability and discoverability - and, finally, it stores and indexes the enriched metadata records, making them searchable via the GoTriple search engine. During this process, many vocabularies are involved having a strategic role in the quality of the results and by extension in the satisfaction of the end users of the platform. What are these vocabularies? Where are they used and how? What are the difficulties encountered in the framework of GoTriple? We will try to answer these questions during the session.
Speaker
Julien Homo
Julien Homo is co-founder of the Foxcub company in France and Data Architect with over 12 years of experience designing, developing, integrating and supporting large projects and solutions, building complex architectures and performing business around Data & Analytics. Graduated with a Master in Artificial Intelligence and specialised in Semantic Web and Data & AI Platforms, he deployed more than 25 Data solutions and services in different sectors, including many projects in the public sector in France and for the CNRS such as Navigae, Isidore or Matilda. In particular, it supports public institutes in the development of enrichment services and the enhancement of their data using the most recent and innovative technologies and techniques (natural language processing, machine and deep learning, cognitive search, graph exploration) from their specification. until they are put into production, industrialised and scaled up. As part of GoTriple, Foxcub contributed to the development of automatic discipline classification and content annotation (tagging) services.
Link
Synthesis
4.Using controlled vocabularies to organise keywords in the SSH
Controlled vocabularies prove useful for improving the quality of topical keywords describing SSH publications. This presentation deals with the challenge of improving the quality of keywords which are expressed in the form of strings (keywords-strings). Experiences from Dariah.lab project (lab.dariah.pl) and TRIPLE project show that these keywords-strings on many occasions originate from controlled vocabularies if they were applied by institutional actors (e.g. data officers in digital libraries) and hence can be easily enriched with references to controlled vocabularies. On the other hand, keywords-strings from authors are less frequently originating from these kinds of resources. This presentation outlines automated and semi-automated methods for enriching both types of keywords-strings with references to controlled vocabularies.
Speaker
Tomasz Umerle
Deputy director of the Department of Current Bibliography at the Institute of Literary Research of the Polish Academy of Sciences (IBL-PAN). In the Centre he is responsible for R&D activities and coordinating development of the research infrastructure. Currently he is engaged in the DARIAH-PL project as the coordinator of Laboratory for supervised semantic discovery. He is chairing the “Bibliographical Data” Working Group at the DARIAH-ERIC consortium. He is also active in the OPERAS RI community, recently in the TRIPLE and CRAFT-OA projects. Interested in cultural and scientific metadata, mostly bibliographical, and documentation of literary culture.
Link
Synthesis
5.Multilinguality and the use of vocabularies at Europeana
Europeana gathers millions of metadata records from over 3500 libraries, archives and museums across Europe. Such a diversity with many quality issues (sparseness, heterogeneity, multilinguality). We present how Europeana and its partners work to address these issues, in a collaborative and flexible fashion. The Europeana Data Model paves the way for gathering richer metadata, but all parties need to exploit this potential. Especially, populating the data model’s elements with multilingual linked open data vocabularies can bring a lot of value. We encourage our partners to publish richer metadata that exploits the controlled vocabularies they often already use. We also perform automatic metadata enrichment, using a purpose-built linked-data based “entity collection” as target. Finally, partners have also embarked on producing their own automatic metadata enrichments. In parallel, the Europeana network has developed a tier system for the Europeana Publishing Framework, which enables measuring and reporting on important aspects of metadata quality in the Europeana datasets.
Speaker
Antoine Isaac
Antoine Isaac is the R&D Manager for Europeana Foundation. He has been researching and promoting the use of Semantic Web and Linked Data technology in culture since his PhD studies at Paris-Sorbonne and the Institut National de l’Audiovisuel. He has especially worked on the representation and interoperability of collections and their vocabularies. He has served in other related W3C efforts, for example on SKOS, Library Linked Data, Data on the Web Best Practices, Data Exchange. He co-chairs the Technical Working Group of the RightsStatements.org initiative and the Discovery Technical Specification Group at the International Image Interoperability Framework (IIIF).
Link
Synthesis
6.ARIADNEplus vocabulary mapping strategy
The ARIADNEplus project aggregated over 3.5 million metadata records describing archaeological sites. These records were converted and integrated to an RDF triple store using a data schema based on the CIDOC Conceptual Reference Model (CRM), then exported to an OpenSearch index for use with the search portal UI. Search is via 3 major facets - place, time and subject. This presentation primarily concerns the subject integration work which involved 59 local vocabularies containing over 19,000 subject terms in 16 different languages. A vocabulary matching tool allowed each data provider to define correctly formatted mappings from their own local vocabularies to a central common ‘spine’ vocabulary (Getty Art & Architecture Thesaurus, AAT). Using the resultant mappings, the metadata records were enriched with AAT identifiers. The AAT poly-hierarchical structure was incorporated into the search system, facilitating both multilingual cross search and hierarchical subject expansion. This approach led to a noticeable improvement in both recall and precision.
Speaker
Ceri Binding
Ceri Binding has been a researcher in the Hypermedia Research Group at University of South Wales since 2007, having previously worked in civil/structural engineering and then software development. During that time he has jointly published several research papers focussed on controlled vocabularies, data integration & interoperability, Linked Open Data and the semantic web. He developed the ‘heritage data’ UK platform making national cultural heritage controlled vocabularies available as Linked Open Data. He produced the SKOS RDF conversion for the Integrative Levels Classification (2nd Edition, ILC2). He created and maintains an open archive of Networked Knowledge Organisation Systems (NKOS) workshop proceedings. Recent international research projects include the ARIADNEplus H2020 project and its predecessor ARIADNE. His research interests include knowledge organisation, controlled vocabularies and semantic web technologies.
Link
Synthesis
7.Vocabularies in the SSH Open Marketplace
SSH Open Marketplace (https://marketplace.sshopencloud.eu) acts as a place to get information on tools, services, training materials, publications, datasets, and workflows from the Social Sciences and Humanities domain. Users can propose to add new items or to extend the information about existing items. SSH Open Marketplace data model operates with dynamic fields that can be connected to vocabularies. The presentation gives information about the technical background of the integration of vocabularies and shares experiences with this approach. On the ‘SSH vocabs commons’ project the used vocabularies of the SSH Open Marketplace are represented. Ideas on how to further develop the ‘SSH vocabs commons’ project are discussed from the perspective of the SSH Open Marketplace.
Speaker
Klaus Illmayer
Klaus Illmayer is Research Software Engineer at Austrian Centre of Digital Humanities and Cultural Heritage (ACDH-CH) at Austrian Academy of Sciences. He holds a PhD in Theatre, Film, and Media Studies from University of Vienna. His research interests focus on digital documenting and archiving of cultural expressions – especially theatrical events – as well as data modelling and establishing digital workflows for research. At ACDH-CH he was involved in digital infrastructure projects like PARTHENOS and SSHOC and is currently working in EOSC Future.
Link
Synthesis
8.Vocabularies in the OpenAire Graph
The OpenAIRE Graph is an open collection of metadata about entities of the research life cycle (publication, research data, software, other types of research results, project grants, organisations, data sources) semantically linked with each other. Metadata and links are collected from trusted scholarly communication sources (like institutional repositories, data archives, publishers’ platforms, Crossref, DOAJ) and enriched thanks to data mining algorithms that detect duplicates, semantic relationships and subject classification. The talk will present how controlled vocabularies are used in the OpenAIRE Graph, focusing on those used for subject classification and how they are leveraged by the OpenAIRE services CONNECT, EXPLORE and MONITOR.
Speaker
Alessia Bardi
Alessia Bardi is a Researcher in computer science at the Institute of Information Science and Technologies of the Italian National Research Council with a PhD in Information Engineering. Her research activities focus on infrastructures for scholarly communication, Open Science, metadata aggregation and interoperability. She is actively working on OpenAIRE as product manager of OpenAIRE CONNECT, which offers customizable discovery portals for research communities, and as member of the technical team that delivers the OpenAIRE Graph.
Link
Synthesis
9.Vocabularies and quality improvement of library catalogues
QA catalogue is a metadata quality assessment tool developed for MARC21 and PICA metadata schema based library catalogues. Catalogues use vocabularies internally for recording their own structure and the terms particular data elements might contain, and externally for enhancing the descriptive data elements. This latter category covers two main areas: subject indexing and normalisation of (personal, family, corporate, geographical) names, titles and events. The tool reads MARC or PICA files (in different formats), analyses some quality dimensions, and saves the results into CSV files.
Speaker
Péter Király
Péter Király, is a software developer and researcher at GWDG, the data centre for Max-Planck-Society and University of Göttingen. He received a PhD from the University of Göttingen in comparative cultural studies. His main research interests are quality assessment of cultural heritage metadata and cultural analytics, the data analysis of these metadata as historical sources. He is an editor of Code4Lib Journal, member of different library and digital humanities working groups, maker and supporter of open source and open data projects. He collaborates with several cultural heritage organisations internationally.
Link
Synthesis
10.Comparing SSH vocabularies and their applications in different systems
Vocabularies are used in a wide range of services. These include long-term digital archives as well as research project overview websites and training platforms. Vocabularies allow depositors and users to describe resources by indicating their most relevant aspects, usually summarised in single keywords. Analysing vocabularies is therefore a productive way of discovering which aspects of a research area are covered by a particular service, and how services relate to each other. This hands-on session presents a practical workflow for comparing vocabularies from different services and identifying possible overlaps. ‘Vocabularies’ is used here in a broad sense, including both controlled vocabularies (ranging from controlled lists to thesauri) and folksonomies, i.e. collections of uncontrolled keywords used in a service. A Jupyter notebook is included, containing the Python code needed to perform the comparison. A set of vocabularies are already provided for comparison (such as the SSH Open Marketplace keywords and Getty AAT). In addition, an accompanying presentation illustrates some preliminary thoughts on vocabulary terminology, usage and interoperability (with a special focus on the users’ perspective) as well as the description of a comparison between vocabularies of the Austrian DH community recently performed at the Austrian Centre for Digital Humanities and Cultural Heritage.

Link to the Jupyter notebook: https://zenodo.org/record/7845914#.ZEEG5HbP0Uo
Speakers
Nina C. Rastinger
Nina C. Rastinger is a researcher at the Austrian Centre of Digital Humanities and Cultural Heritage of the Austrian Academy of Sciences with a background in German Philology and Psychology. She is currently part of the project ‘DiTAH’ and is leading the City of Vienna funded project ‘Visiting Vienna’ in the context of her dissertation on lists in historical newspapers. In the past, she has participated in various research and infrastructure projects at the ACDH-CH, e.g. the projects ‘TIME MACHINE VIENNA’ or ‘RepoLandscape’. Her current areas of interest include early modern periodicals, digital research workflows and multimodal corpus linguistics.
Massimiliano Carloni
Massimiliano Carloni is a data modeller and repository manager at the Austrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH). His main interests lie in long-term digital preservation and semantic technologies, with a special focus on graph-based data models and Linked Open Data. He is part of the managing team behind the archive for digital research data ARCHE and has contributed to the creation of the new library catalogue of the Austrian Academy of Sciences. He is currently working in the project ATRIUM, which aims to facilitate digital methods and improve data and service interoperability in the Arts and Humanities. He is the main responsible for the Vocabs service at ACDH-CH and DARIAH-EU.
Matej Ďurčo
Matej Ďurčo is Chief Technology Officer for DARIAH-EU. He is also head of the technical group “Digital Humanities Research & Infrastructure” at the Austrian Centre for Digital Humanities and Cultural Heritage at the Austrian Academy of Sciences and one of the key figures in founding this institute. Since 2009, he has also participated actively in the Austrian research infrastructures core group and supported the build-up of the pan-European research infrastructures CLARIN and DARIAH, both on the Austrian and European levels. Besides building technical infrastructures, his main concern has been the social aspect, sharing knowledge and acting as interpreter between humanities research and the technical world ensuring mutual understanding in order to find optimal solutions for the specific needs of the researchers.
Link
Synthesis

Use of vocabularies for metadata curation and quality assessment in Social Sciences and Humanities

Learning outcomes

Participants

Organisation Team

Photos

Bibliography

Resources

1.SSH Vocabulary Commons

2.Publishing SSHOC Multilingual Terminologies

3.How to use vocabularies to enrich GoTriple ?

4.Using controlled vocabularies to organise keywords in the SSH

5.Multilinguality and the use of vocabularies at Europeana

6.ARIADNEplus vocabulary mapping strategy

7.Vocabularies in the SSH Open Marketplace

8.Vocabularies in the OpenAire Graph

9.Vocabularies and quality improvement of library catalogues

10.Comparing SSH vocabularies and their applications in different systems

Cite as

Reuse conditions

Full metadata