Digging for Gold - Knowledge Extraction from Text

Digging for Gold - Knowledge Extraction from Text

Location

Madrid, Spain

Date

9–11 May 2023

Authors

and 12 more

Topics

About the Local Host - UNED

UNED is a distance-university and the largest learning university in Spain in terms of the number of students enrolled. The National University of Distance Education has as its mission the public service of higher education through the modality of distance education. To respond to this mission, it facilitates access to university education and the continuity of their studies for all people qualified to pursue higher studies who choose the UNED educational system for its methodology or for work, economic, residence or any other.

The UNED is the public university with the largest number of students in the country, more than 250,000. That it has more than 40 years of experience making the principle of equal opportunities in access to higher education a reality thanks to a methodology based on the principles of distance learning, and focused on the needs of the student. Leader in the application of cutting-edge technologies applied to learning, with the largest offer of virtual courses in the country. It has a very wide training offer:

27 degrees adapted to the European Higher Education Area (EHEA)
65 official university Master’s degrees adapted to the EHEA.
18 PhD programmes adapted to the EHEA.
4 diplomas and 5 technical engineering (all in extinction).
18 degrees and 3 engineering (all in extinction).
University Access Courses for over 25s and over 45s (CAD).
University Access Tests for over 25s and over 45s (CAD).
Access to the University by professional or work experience (for people over 40 years of age)
13 languages are offered at the Distance Learning University Centre (CUID) – foreign languages, Spanish for foreigners and co-official languages.
Permanent Training (more than 600 courses).
University extension courses (more than 350 courses).
Summer courses (more than 140 courses).

UNED Senior is designed for people over 55 years of age, regardless of their academic training, with the aim of expanding knowledge on current issues.

UNED Open is a new channel created to facilitate the search for open educational content from UNED.

With more than 1,400 professors, in 9 Faculties, 2 Technical Schools and CAD at the Headquarters, and more than 6,500 tutor professors, distributed in the Associated Centres.

Implemented in all the Autonomous Communities through 61 Associated Centres, more than 100 extensions and classrooms, and 28 zone centres in the Community of Madrid where tutorials are given and face-to-face tests prepared by the teaching teams are carried out.

UNED has a presence in 11 countries in Europe, America and Africa through 12 UNED Centres abroad and 4 concerted centres for exams. Especially concerned with responding to the demands for training of groups with special needs (people with disabilities, immigrants, reintegration of the prison population).

Session 1 Python

To get the most out of this session you will need to have a few things downloaded to your machine.

Anaconda Installing: Local installation of Anaconda (Jupyter Notebook) https://www.anaconda.com/ (download)

Google Collaborate You will use Google Collaborate to access files on your browser https://colab.research.google.com/notebooks/intro.ipynb?hl=en

You will download files from this GitHub repository: https://github.com/sros-UNED/NLPforHumanist download them through this link (code —> download zip)

Session 5-8 Preparation Notes

The most important technical requirement for Day 2 would be installing R and (preferably) RStudio (development environment for R that will make things easier). We will use the stylo library for most of the day; see step-by-step introduction to stylo for beginners, or the more extensive HOW TO.

In case you don’t / can’t install R and stylo locally, there will be an option to run analysis from Colaboratory notebooks, just be aware it will require more coding, not less coding, because of the stylo’s interface that doesn’t work in Colab. You are free to bring your own collection of texts to the workshop, but you can find more ready to use plain text fiction collections in various languages on the Computational Stylistics website.

Materials

We will use GDrive folder (https://drive.google.com/drive/folders/1V8KFkDxGpKvYqjbkymqukH1QdjwTJeak?usp=sharing) holds all necessary materials.

Your options are:

Download it and work locally.
Download -> re-upload folder to work in Colabs’ virtual machine with R (or just open the .ipynb notebook, then File -> Save a copy in Drive. It will create a copy of the notebook on your own Drive.)

To copy folder to your own GDrive, do this:-

Download CLS_Madrid_Folder
Extract the downloaded .zip
Upload the extracted folder back to your GDrive

1.Digging for Gold I: Introduction to Python
This lesson provides an introduction to variables, operators, loops, lists, and dictionaries. Students will learn how to use lists and dictionaries to manipulate data in their programs. By the end of the lesson, students will have a foundational understanding of Python’s core concepts.

Digging for Gold I: Introduction to Python
Speaker
Salvador Ros
Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data’s architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.
2.Digging for Gold II: Extracting useful information from a corpus pt1 - Cleaning
In this lesson, students will learn how to use the Natural Language Toolkit (nltk) to download a corpus and then use Python and regular expressions to clean the text data. Students will learn how to remove unwanted characters and symbols, tokenise the text into individual words, and how to remove stop words. This lesson will equip students with the skills to clean and preprocess text data for further analysis and natural language processing tasks.

Digging for Gold II: Extracting useful information from a corpus pt1 - Cleaning
Speakers
Salvador Ros
Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data’s architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.
Alvaro Pérez
Álvaro Pérez Pozo is a computational linguist at UNED and has published work on topics such as automatic stanza classification in Spanish poetry and artificial intelligence applications for the humanities. He also has experience obtaining, cleaning, and compiling very large text collections.
3.Digging for Gold III: Extracting useful information from a corpus pt2
This is a continuation of lesson 2 where you will find how to use more advanced Python data structures and we will introduce how to use the Guttenberg project Python library to download corpora and how to clean text. Finally, students will learn what is and how to use the panda’s library for working and processing very large corpora.

Digging for Gold III: Extracting useful information from a corpus pt2
Speakers
Alvaro Pérez
Álvaro Pérez Pozo is a computational linguist at UNED and has published work on topics such as automatic stanza classification in Spanish poetry and artificial intelligence applications for the humanities. He also has experience obtaining, cleaning, and compiling very large text collections.
Salvador Ros
Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data’s architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.
4.Digging for Gold IV: Word Embeddings
In this lesson, students will use the natural language processing library, spaCy, to extract all the “scary” verbs from a corpus of horror books. They will then use the similarity function in spaCy to determine which verbs are most closely related to the concept of fear. Additionally, they will determine which book is the “scariest” by calculating the ratio of scary verbs to total words in each book. This exercise will provide students with an understanding of how natural language processing can be used to analyse and compare different works of literature based on a specific set of criteria.

Digging for Gold IV: Word Embeddings pt 1

Digging for Gold IV: Word Embeddings pt 2
Speaker
Salvador Ros
Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data’s architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.
5.Digging for Gold V: Vector Semantics and Embeddings
Embedding spaces and computational literary studies have emerged as a fruitful convergence of computer science and literary analysis. This presentation introduces the concept of embedding spaces, which represent textual data as high-dimensional vectors, capturing semantic relationships and contextual information. Specifically, word embeddings enable tasks such as sentiment analysis and authorship attribution, while sentence and document embeddings facilitate analysis at larger text units. By leveraging computational methods, researchers can delve into the complexities of literature, offering quantitative insights and novel research avenues. This interdisciplinary approach holds great promise for revolutionising the study of literature and deepening our understanding of its intricacies.

Digging for Gold V: Vector Semantics and Embeddings
Speaker
Javier de la Rosa
Javier de la Rosa is Senior Research Scientist at the Artificial Intelligence Laboratory of the National Library of Norway, and former postdoc in Natural Language Processing (NLP) at UNED. He holds a PhD specialising in Digital Humanities and an MSc in Artificial Intelligence. His interests are in natural language processing applied to historical and literary texts, with a focus on large language models. He has previously worked at Stanford and the University of Western Ontario.
6.Finding Gold I: Stylometry - Distances and differences
This is an introductory overview of the field of stylometry and multivariate text analysis. We discuss classic approaches to text representation, such as bag of words, and show the ability of word frequencies to reflect meaningful cultural and social conditions of texts: genre, chronology and authorship.

R and RStudio

The most important thing to do would be installing R and RStudio on your machine. We will use the stylo library for most of the day; if you want, you can look at step-by-step introduction to stylo for beginners, or at the more extensive HOW TO, but we will cover all the basics. NB. In case you don’t / can’t install R and stylo locally, there will be an option to run analysis from Colab, just be aware it will require more coding, not less coding, because of the stylo’s interface that doesn’t work in Colab. You are free to bring your own collection of texts to the workshop, but you also can find plain text fiction collections in various languages on the Computational Stylistics website.

Materials

We will use GDrive folder that holds all necessary materials. Your options are:
- Download it and work locally
- Download -> re-upload folder to work in Colabs’ virtual machine with R (or just open the .ipynb notebook, then File -> Save a copy in Drive. It will create a copy of the notebook on your own Drive.)
To copy folder to your own GDrive, do this:
- Download CLS_Madrid_Folder
- Extract the downloaded .zip
- Upload the extracted folder back to your GDrive
Finding Gold I: Stylometry - Distances and differences
Speakers
Artjoms Šeļa
Dr. Artjoms Šeļa is a postdoctoral researcher at the Methodology department of the Institute of Polish Language (PAN, Kraków) and is a research fellow at the University of Tartu (Estonia). He holds a PhD in Russian Literature and uses computational methods to understand historical change in literature and culture. His main research interests include stylometry, verse studies and cultural evolution. Sometimes he forays into digital preservation and history of quantitative methods in humanities.
Joanna Byszuk
Joanna Byszuk is a research associate and a member of Computational Stylistics Group at the Institute of Polish Language, Polish Academy of Sciences, Kraków. She has worked on ‘Foundations of Computational Stylistics’ (2018-2022) and ‘CLS Infra’ (2022-2025) projects, focusing on cross-lingual computational stylistics and advancing stylometric methodology and its understanding, especially locating method limitations and developing evaluation procedures. She was also engaged in the COST Action Distant Reading, where she was leading Working Group 2 ‘Methods and Tools’ (2020-2022), and in ‘Deep Learning in the Computational Stylistics’ collaboration with the University of Antwerp. She is interested in discourse analysis and sociolinguistics, especially in connection to ‘big data’ and multimodal perspective, establishing in her dissertation a methodology of multimodal stylometry for the study of audiovisual works.
7.Finding Gold II: Stylometry with R
This session introduces the ‘stylo’ library for R that allows to perform different stylometric analysis on a collection of documents. We quickly introduce R language and go over a graphical user interface of ‘stylo’ to show practicalities of feature selection, distance metrics, cluster analysis and sampling.

Note - Download instructions for software and materials can be found in Session 5

Finding Gold II: Stylometry with R
Speakers
Maciej Eder
Prof. Maciej Eder is the director of the Institute of Polish Language at the Polish Academy of Sciences, chair of the Committee of Linguistics at the Polish Academy of Sciences, vice-chair of the COST Action ‘Distant Reading’, co-founder of the Computational Stylistics Group, and the main developer of the R package ‘Stylo’ for performing stylometric analyses. He is interested in European literature of the Renaissance and the Baroque, classical heritage in early modern literature, and quantitative approaches to style variation. These include measuring style using statistical methods, authorship attribution based on quantitative measures, as well as ‘distant reading’ methods to analyse dozens (or hundreds) of literary works at a time.
Artjoms Šeļa
Dr. Artjoms Šeļa is a postdoctoral researcher at the Methodology department of the Institute of Polish Language (PAN, Kraków) and is a research fellow at the University of Tartu (Estonia). He holds a PhD in Russian Literature and uses computational methods to understand historical change in literature and culture. His main research interests include stylometry, verse studies and cultural evolution. Sometimes he forays into digital preservation and history of quantitative methods in humanities.
Joanna Byszuk
Joanna Byszuk is a research associate and a member of Computational Stylistics Group at the Institute of Polish Language, Polish Academy of Sciences, Kraków. She has worked on ‘Foundations of Computational Stylistics’ (2018-2022) and ‘CLS Infra’ (2022-2025) projects, focusing on cross-lingual computational stylistics and advancing stylometric methodology and its understanding, especially locating method limitations and developing evaluation procedures. She was also engaged in the COST Action Distant Reading, where she was leading Working Group 2 ‘Methods and Tools’ (2020-2022), and in ‘Deep Learning in the Computational Stylistics’ collaboration with the University of Antwerp. She is interested in discourse analysis and sociolinguistics, especially in connection to ‘big data’ and multimodal perspective, establishing in her dissertation a methodology of multimodal stylometry for the study of audiovisual works.
8.Finding Gold III: Keywords and associations
In this session, we look at the ideas of the word ‘keyness’ and the ways to understand which features differ between texts and corpora. We also show how to trace features that might underlie text groupings and clusters and detect a potential bias, or systematic error in a corpus.

Finding Gold III: Keywords and associations

Note - Download instructions for software and materials can be found in Session 5
Speakers
Artjoms Šeļa
Dr. Artjoms Šeļa is a postdoctoral researcher at the Methodology department of the Institute of Polish Language (PAN, Kraków) and is a research fellow at the University of Tartu (Estonia). He holds a PhD in Russian Literature and uses computational methods to understand historical change in literature and culture. His main research interests include stylometry, verse studies and cultural evolution. Sometimes he forays into digital preservation and history of quantitative methods in humanities.
Joanna Byszuk
Joanna Byszuk is a research associate and a member of Computational Stylistics Group at the Institute of Polish Language, Polish Academy of Sciences, Kraków. She has worked on ‘Foundations of Computational Stylistics’ (2018-2022) and ‘CLS Infra’ (2022-2025) projects, focusing on cross-lingual computational stylistics and advancing stylometric methodology and its understanding, especially locating method limitations and developing evaluation procedures. She was also engaged in the COST Action Distant Reading, where she was leading Working Group 2 ‘Methods and Tools’ (2020-2022), and in ‘Deep Learning in the Computational Stylistics’ collaboration with the University of Antwerp. She is interested in discourse analysis and sociolinguistics, especially in connection to ‘big data’ and multimodal perspective, establishing in her dissertation a methodology of multimodal stylometry for the study of audiovisual works.
9.Finding Gold IV: Case-study - Stylometry applied to Old Spanish Poetry
In this session, we will try to find out whether the Old Spanish version of the Book of Alexander was written by an author known as Gonzalo de Berceo, as it is stated in one manuscript, or not. We will gather the data from the Old Spanish Textual Archive (OSTA), a corpus of Old Spanish Texts lemmatised and PoS tagged. Then we will use the stylo package to find it out. Afterwards, we will try to see if the rhyming words are a good element to establish the authorship of the Book of Alexander.

Finding Gold IV: Case-study - Stylometry applied to Old Spanish Poetry
Speaker
José Manuel Fradejas Rueda
José Manuel Fradejas Rueda is a Professor of Spanish Language at Universidad of Vallodolid. Prof. Fradejas Rueda has been engaged with computational text analysis (including text editing, mostly Medieval sources) since late 1980s, however he has recently become more interested in these digital approaches when he discovered R and how useful can be for a literary / linguistic scholar.
10.Showing Gold I: Visualisation
This session aims to introduce visualisation grammars and their applications to accelerate exploratory data analysis in digital humanities projects by allowing users to create interactive visualisations with minimal effort in a Jupyter notebook. To this end, a real application will be built by replicating the design process of a visualisation system based on machine-annotated textual data, in a similar setup to many digital humanities research projects.

You can download the course materials for this session here.

Showing Gold I: Visualisation
Speaker
Alejandro Benito-Santos
Alejandro Benito-Santos is a postdoctoral researcher in the School of Computer Science, UNED, working at the intersection of textual analysis, digital humanities, information visualisation and HCI. His background includes working with unstructured and semistructured text and he designed an interactive visual analytics system that allowed users to navigate a historical dictionary given in TEI format as part of his postgraduate degree.
11.Storing Gold II: Lindat/Teitok
Corpus data should be kept FAIR: findable, accessible, interoperable and reusable. This lecture will show how to do that using an example set-up in which the LINDAT repository is used for findability (as well as long-term preservation), TEITOK for accessibility (as well as maintenance and annotation), and TEI for interoperability and reusability. And we will show how to interact with these tools and standard both via a GUI (online) and using an API (to interact programmatically).

Storing Gold II: Lindat/Teitok
Speaker
Maarten Janssen
With a background in computational linguistics, Maarten has been involved in many corpus projects. Over the course of time he has developed the TEITOK environment, which is intended to allow linguists to build, maintain, and improve their own corpus without the need for extensive computational skills. Maarten is currently employed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, at Charles University in Prague.

Digging for Gold - Knowledge Extraction from Text

About the Local Host - UNED

Session 1 Python

Session 5-8 Preparation Notes

Materials

1.Digging for Gold I: Introduction to Python

2.Digging for Gold II: Extracting useful information from a corpus pt1 - Cleaning

3.Digging for Gold III: Extracting useful information from a corpus pt2

4.Digging for Gold IV: Word Embeddings

5.Digging for Gold V: Vector Semantics and Embeddings

6.Finding Gold I: Stylometry - Distances and differences

R and RStudio

Materials

7.Finding Gold II: Stylometry with R

8.Finding Gold III: Keywords and associations

9.Finding Gold IV: Case-study - Stylometry applied to Old Spanish Poetry

10.Showing Gold I: Visualisation

11.Storing Gold II: Lindat/Teitok

Cite as

Reuse conditions

Full metadata