Skip to main content
Home

Digging for Gold - Knowledge Extraction from Text

About the Local Host - UNED

UNED is a distance-university and the largest learning university in Spain in terms of the number of students enrolled. The National University of Distance Education has as its mission the public service of higher education through the modality of distance education. To respond to this mission, it facilitates access to university education and the continuity of their studies for all people qualified to pursue higher studies who choose the UNED educational system for its methodology or for work, economic, residence or any other.

The UNED is the public university with the largest number of students in the country, more than 250,000. That it has more than 40 years of experience making the principle of equal opportunities in access to higher education a reality thanks to a methodology based on the principles of distance learning, and focused on the needs of the student. Leader in the application of cutting-edge technologies applied to learning, with the largest offer of virtual courses in the country. It has a very wide training offer:

  • 27 degrees adapted to the European Higher Education Area (EHEA)
  • 65 official university Master’s degrees adapted to the EHEA.
  • 18 PhD programmes adapted to the EHEA.
  • 4 diplomas and 5 technical engineering (all in extinction).
  • 18 degrees and 3 engineering (all in extinction).
  • University Access Courses for over 25s and over 45s (CAD).
  • University Access Tests for over 25s and over 45s (CAD).
  • Access to the University by professional or work experience (for people over 40 years of age)
  • 13 languages are offered at the Distance Learning University Centre (CUID) – foreign languages, Spanish for foreigners and co-official languages.
  • Permanent Training (more than 600 courses).
  • University extension courses (more than 350 courses).
  • Summer courses (more than 140 courses).

UNED Senior is designed for people over 55 years of age, regardless of their academic training, with the aim of expanding knowledge on current issues.

UNED Open is a new channel created to facilitate the search for open educational content from UNED.

With more than 1,400 professors, in 9 Faculties, 2 Technical Schools and CAD at the Headquarters, and more than 6,500 tutor professors, distributed in the Associated Centres.

Implemented in all the Autonomous Communities through 61 Associated Centres, more than 100 extensions and classrooms, and 28 zone centres in the Community of Madrid where tutorials are given and face-to-face tests prepared by the teaching teams are carried out.

UNED has a presence in 11 countries in Europe, America and Africa through 12 UNED Centres abroad and 4 concerted centres for exams. Especially concerned with responding to the demands for training of groups with special needs (people with disabilities, immigrants, reintegration of the prison population).


Session 1 Python

To get the most out of this session you will need to have a few things downloaded to your machine.

Anaconda Installing: Local installation of Anaconda (Jupyter Notebook) https://www.anaconda.com/ (download)

Google Collaborate You will use Google Collaborate to access files on your browser https://colab.research.google.com/notebooks/intro.ipynb?hl=en

You will download files from this GitHub repository: https://github.com/sros-UNED/NLPforHumanist download them through this link (code —> download zip)

Session 5-8 Preparation Notes

The most important technical requirement for Day 2 would be installing R and (preferably) RStudio (development environment for R that will make things easier). We will use the stylo library for most of the day; see step-by-step introduction to stylo for beginners, or the more extensive HOW TO.

In case you don’t / can’t install R and stylo locally, there will be an option to run analysis from Colaboratory notebooks, just be aware it will require more coding, not less coding, because of the stylo’s interface that doesn’t work in Colab. You are free to bring your own collection of texts to the workshop, but you can find more ready to use plain text fiction collections in various languages on the Computational Stylistics website.

Materials

We will use GDrive folder (https://drive.google.com/drive/folders/1V8KFkDxGpKvYqjbkymqukH1QdjwTJeak?usp=sharing) holds all necessary materials.

Your options are:

  • Download it and work locally.
  • Download -> re-upload folder to work in Colabs’ virtual machine with R (or just open the .ipynb notebook, then File -> Save a copy in Drive. It will create a copy of the notebook on your own Drive.)

To copy folder to your own GDrive, do this:-

  • Download CLS_Madrid_Folder
  • Extract the downloaded .zip
  • Upload the extracted folder back to your GDrive

  1. 1.Digging for Gold I: Introduction to Python

    This lesson provides an introduction to variables, operators, loops, lists, and dictionaries. Students will learn how to use lists and dictionaries to manipulate data in their programs. By the end of the lesson, students will have a foundational understanding of Python’s core concepts.

    Digging for Gold I: Introduction to Python

    Speaker
    • Salvador Ros

      Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data’s architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.

  2. 2.Digging for Gold II: Extracting useful information from a corpus pt1 - Cleaning

    In this lesson, students will learn how to use the Natural Language Toolkit (nltk) to download a corpus and then use Python and regular expressions to clean the text data. Students will learn how to remove unwanted characters and symbols, tokenise the text into individual words, and how to remove stop words. This lesson will equip students with the skills to clean and preprocess text data for further analysis and natural language processing tasks.

    Digging for Gold II: Extracting useful information from a corpus pt1 - Cleaning

    Speakers
    • Salvador Ros

      Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data’s architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.

    • Alvaro Pérez

      Álvaro Pérez Pozo is a computational linguist at UNED and has published work on topics such as automatic stanza classification in Spanish poetry and artificial intelligence applications for the humanities. He also has experience obtaining, cleaning, and compiling very large text collections.

  3. 3.Digging for Gold III: Extracting useful information from a corpus pt2

    This is a continuation of lesson 2 where you will find how to use more advanced Python data structures and we will introduce how to use the Guttenberg project Python library to download corpora and how to clean text. Finally, students will learn what is and how to use the panda’s library for working and processing very large corpora.

    Digging for Gold III: Extracting useful information from a corpus pt2

    Speakers
    • Alvaro Pérez

      Álvaro Pérez Pozo is a computational linguist at UNED and has published work on topics such as automatic stanza classification in Spanish poetry and artificial intelligence applications for the humanities. He also has experience obtaining, cleaning, and compiling very large text collections.

    • Salvador Ros

      Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data’s architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.

  4. 4.Digging for Gold IV: Word Embeddings

    In this lesson, students will use the natural language processing library, spaCy, to extract all the “scary” verbs from a corpus of horror books. They will then use the similarity function in spaCy to determine which verbs are most closely related to the concept of fear. Additionally, they will determine which book is the “scariest” by calculating the ratio of scary verbs to total words in each book. This exercise will provide students with an understanding of how natural language processing can be used to analyse and compare different works of literature based on a specific set of criteria.

    Digging for Gold IV: Word Embeddings pt 1

    Digging for Gold IV: Word Embeddings pt 2

    Speaker
    • Salvador Ros

      Salvador Ros is an Associate Professor at UNED (National Distance Education University) at the School of Computer Science. Currently, He is the Technical Director of POSTDATA ERC Starting Grant and LyrAIcs proof of concept project, and Director of the Master of Big Data’s architectures and technologies and Data Science. Salvador Ros has been Director of Learning Technologies at UNED for six years and Vice Dean of Technologies at Computer Science School for six years. He has received the Extraordinary Doctoral Award in the UNED for his PhD dissertation and two special best paper awards. He is a strategical and innovation Manager in the Public sector. He graduated from the Leadership Program for Public Sector Management by IESE Business School, Universidad de Navarra, in a Strategic Senior Management for Universities by Universidad de Nebrija y Politécnica de Barcelona and the Leadership Program for Innovation and entrepreneurship in Public Sector by Deusto Business School at Universidad de Deusto. He has been a senior member of the IEEE Education society since 2007. His research and professional activity, in general, is focused on enhanced learning technologies for distance learning scenarios and learning analytics, big data, and an IA applied to Science and Humanities and strategic consultant for the public sector.

  5. 5.Digging for Gold V: Vector Semantics and Embeddings

    Embedding spaces and computational literary studies have emerged as a fruitful convergence of computer science and literary analysis. This presentation introduces the concept of embedding spaces, which represent textual data as high-dimensional vectors, capturing semantic relationships and contextual information. Specifically, word embeddings enable tasks such as sentiment analysis and authorship attribution, while sentence and document embeddings facilitate analysis at larger text units. By leveraging computational methods, researchers can delve into the complexities of literature, offering quantitative insights and novel research avenues. This interdisciplinary approach holds great promise for revolutionising the study of literature and deepening our understanding of its intricacies.

    Digging for Gold V: Vector Semantics and Embeddings

    Speaker
    • Javier de la Rosa

      Javier de la Rosa is Senior Research Scientist at the Artificial Intelligence Laboratory of the National Library of Norway, and former postdoc in Natural Language Processing (NLP) at UNED. He holds a PhD specialising in Digital Humanities and an MSc in Artificial Intelligence. His interests are in natural language processing applied to historical and literary texts, with a focus on large language models. He has previously worked at Stanford and the University of Western Ontario.

  6. 6.Finding Gold I: Stylometry - Distances and differences

    This is an introductory overview of the field of stylometry and multivariate text analysis. We discuss classic approaches to text representation, such as bag of words, and show the ability of word frequencies to reflect meaningful cultural and social conditions of texts: genre, chronology and authorship.

    R and RStudio

    The most important thing to do would be installing R and RStudio on your machine. We will use the stylo library for most of the day; if you want, you can look at step-by-step introduction to stylo for beginners, or at the more extensive HOW TO, but we will cover all the basics. NB. In case you don’t / can’t install R and stylo locally, there will be an option to run analysis from Colab, just be aware it will require more coding, not less coding, because of the stylo’s interface that doesn’t work in Colab. You are free to bring your own collection of texts to the workshop, but you also can find plain text fiction collections in various languages on the Computational Stylistics website.

    Materials

    We will use GDrive folder that holds all necessary materials. Your options are:

    • Download it and work locally
    • Download -> re-upload folder to work in Colabs’ virtual machine with R (or just open the .ipynb notebook, then File -> Save a copy in Drive. It will create a copy of the notebook on your own Drive.)

    To copy folder to your own GDrive, do this:

    • Download CLS_Madrid_Folder
    • Extract the downloaded .zip
    • Upload the extracted folder back to your GDrive

    Finding Gold I: Stylometry - Distances and differences

    Speakers
    • Artjoms Šeļa

      Dr. Artjoms Šeļa is a postdoctoral researcher at the Methodology department of the Institute of Polish Language (PAN, Kraków) and is a research fellow at the University of Tartu (Estonia). He holds a PhD in Russian Literature and uses computational methods to understand historical change in literature and culture. His main research interests include stylometry, verse studies and cultural evolution. Sometimes he forays into digital preservation and history of quantitative methods in humanities.

    • Joanna Byszuk

      Joanna Byszuk is a research associate and a member of Computational Stylistics Group at the Institute of Polish Language, Polish Academy of Sciences, Kraków. She has worked on ‘Foundations of Computational Stylistics’ (2018-2022) and ‘CLS Infra’ (2022-2025) projects, focusing on cross-lingual computational stylistics and advancing stylometric methodology and its understanding, especially locating method limitations and developing evaluation procedures. She was also engaged in the COST Action Distant Reading, where she was leading Working Group 2 ‘Methods and Tools’ (2020-2022), and in ‘Deep Learning in the Computational Stylistics’ collaboration with the University of Antwerp. She is interested in discourse analysis and sociolinguistics, especially in connection to ‘big data’ and multimodal perspective, establishing in her dissertation a methodology of multimodal stylometry for the study of audiovisual works.

  7. 7.Finding Gold II: Stylometry with R

    This session introduces the ‘stylo’ library for R that allows to perform different stylometric analysis on a collection of documents. We quickly introduce R language and go over a graphical user interface of ‘stylo’ to show practicalities of feature selection, distance metrics, cluster analysis and sampling.

    Note - Download instructions for software and materials can be found in Session 5

    Finding Gold II: Stylometry with R

    Speakers
    • Maciej Eder

      Prof. Maciej Eder is the director of the Institute of Polish Language at the Polish Academy of Sciences, chair of the Committee of Linguistics at the Polish Academy of Sciences, vice-chair of the COST Action ‘Distant Reading’, co-founder of the Computational Stylistics Group, and the main developer of the R package ‘Stylo’ for performing stylometric analyses. He is interested in European literature of the Renaissance and the Baroque, classical heritage in early modern literature, and quantitative approaches to style variation. These include measuring style using statistical methods, authorship attribution based on quantitative measures, as well as ‘distant reading’ methods to analyse dozens (or hundreds) of literary works at a time.

    • Artjoms Šeļa

      Dr. Artjoms Šeļa is a postdoctoral researcher at the Methodology department of the Institute of Polish Language (PAN, Kraków) and is a research fellow at the University of Tartu (Estonia). He holds a PhD in Russian Literature and uses computational methods to understand historical change in literature and culture. His main research interests include stylometry, verse studies and cultural evolution. Sometimes he forays into digital preservation and history of quantitative methods in humanities.

    • Joanna Byszuk

      Joanna Byszuk is a research associate and a member of Computational Stylistics Group at the Institute of Polish Language, Polish Academy of Sciences, Kraków. She has worked on ‘Foundations of Computational Stylistics’ (2018-2022) and ‘CLS Infra’ (2022-2025) projects, focusing on cross-lingual computational stylistics and advancing stylometric methodology and its understanding, especially locating method limitations and developing evaluation procedures. She was also engaged in the COST Action Distant Reading, where she was leading Working Group 2 ‘Methods and Tools’ (2020-2022), and in ‘Deep Learning in the Computational Stylistics’ collaboration with the University of Antwerp. She is interested in discourse analysis and sociolinguistics, especially in connection to ‘big data’ and multimodal perspective, establishing in her dissertation a methodology of multimodal stylometry for the study of audiovisual works.

  8. 8.Finding Gold III: Keywords and associations

    In this session, we look at the ideas of the word ‘keyness’ and the ways to understand which features differ between texts and corpora. We also show how to trace features that might underlie text groupings and clusters and detect a potential bias, or systematic error in a corpus.

    Finding Gold III: Keywords and associations

    Note - Download instructions for software and materials can be found in Session 5

    Speakers
    • Artjoms Šeļa

      Dr. Artjoms Šeļa is a postdoctoral researcher at the Methodology department of the Institute of Polish Language (PAN, Kraków) and is a research fellow at the University of Tartu (Estonia). He holds a PhD in Russian Literature and uses computational methods to understand historical change in literature and culture. His main research interests include stylometry, verse studies and cultural evolution. Sometimes he forays into digital preservation and history of quantitative methods in humanities.

    • Joanna Byszuk

      Joanna Byszuk is a research associate and a member of Computational Stylistics Group at the Institute of Polish Language, Polish Academy of Sciences, Kraków. She has worked on ‘Foundations of Computational Stylistics’ (2018-2022) and ‘CLS Infra’ (2022-2025) projects, focusing on cross-lingual computational stylistics and advancing stylometric methodology and its understanding, especially locating method limitations and developing evaluation procedures. She was also engaged in the COST Action Distant Reading, where she was leading Working Group 2 ‘Methods and Tools’ (2020-2022), and in ‘Deep Learning in the Computational Stylistics’ collaboration with the University of Antwerp. She is interested in discourse analysis and sociolinguistics, especially in connection to ‘big data’ and multimodal perspective, establishing in her dissertation a methodology of multimodal stylometry for the study of audiovisual works.

  9. 9.Finding Gold IV: Case-study - Stylometry applied to Old Spanish Poetry

    In this session, we will try to find out whether the Old Spanish version of the Book of Alexander was written by an author known as Gonzalo de Berceo, as it is stated in one manuscript, or not. We will gather the data from the Old Spanish Textual Archive (OSTA), a corpus of Old Spanish Texts lemmatised and PoS tagged. Then we will use the stylo package to find it out. Afterwards, we will try to see if the rhyming words are a good element to establish the authorship of the Book of Alexander.

    Finding Gold IV: Case-study - Stylometry applied to Old Spanish Poetry

    Speaker
    • José Manuel Fradejas Rueda

      José Manuel Fradejas Rueda is a Professor of Spanish Language at Universidad of Vallodolid. Prof. Fradejas Rueda has been engaged with computational text analysis (including text editing, mostly Medieval sources) since late 1980s, however he has recently become more interested in these digital approaches when he discovered R and how useful can be for a literary / linguistic scholar.

  10. 10.Showing Gold I: Visualisation

    This session aims to introduce visualisation grammars and their applications to accelerate exploratory data analysis in digital humanities projects by allowing users to create interactive visualisations with minimal effort in a Jupyter notebook. To this end, a real application will be built by replicating the design process of a visualisation system based on machine-annotated textual data, in a similar setup to many digital humanities research projects.

    You can download the course materials for this session here.

    Showing Gold I: Visualisation

    Speaker
    • Alejandro Benito-Santos

      Alejandro Benito-Santos is a postdoctoral researcher in the School of Computer Science, UNED, working at the intersection of textual analysis, digital humanities, information visualisation and HCI. His background includes working with unstructured and semistructured text and he designed an interactive visual analytics system that allowed users to navigate a historical dictionary given in TEI format as part of his postgraduate degree.

  11. 11.Storing Gold II: Lindat/Teitok

    Corpus data should be kept FAIR: findable, accessible, interoperable and reusable. This lecture will show how to do that using an example set-up in which the LINDAT repository is used for findability (as well as long-term preservation), TEITOK for accessibility (as well as maintenance and annotation), and TEI for interoperability and reusability. And we will show how to interact with these tools and standard both via a GUI (online) and using an API (to interact programmatically).

    Storing Gold II: Lindat/Teitok

    Speaker
    • Maarten Janssen

      With a background in computational linguistics, Maarten has been involved in many corpus projects. Over the course of time he has developed the TEITOK environment, which is intended to allow linguists to build, maintain, and improve their own corpus without the need for extensive computational skills. Maarten is currently employed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, at Charles University in Prague.

Cite as

Guillermo Marco Remon, Alvaro Pérez, Artjoms Šeļa, Maciej Eder, José Manuel Fradejas Rueda, Alejandro Benito-Santos, Maarten Janssen, Vicky Garnett, Salvador Ros, Sarah Hoover, Justin Tonra, Joanna Byszuk, Bartłomiej Kunda, Anna Dijsktra, Lisanne van Rossum and Ciara L Murphy (2024). Digging for Gold - Knowledge Extraction from Text. Version 1.0.0. DARIAH Campus [Event]. https://campus.dariah.eu/resources/events/digging-for-gold-knowledge-extraction-from-text

Reuse conditions

Resources hosted on DARIAH-Campus are subjects to the DARIAH-Campus Training Materials Reuse Charter.

Full metadata

Title:
Digging for Gold - Knowledge Extraction from Text
Authors:
Guillermo Marco Remon, Alvaro Pérez, Artjoms Šeļa, Maciej Eder, José Manuel Fradejas Rueda, Alejandro Benito-Santos, Maarten Janssen, Vicky Garnett, Salvador Ros, Sarah Hoover, Justin Tonra, Joanna Byszuk, Bartłomiej Kunda, Anna Dijsktra, Lisanne van Rossum, Ciara L Murphy
Domain:
Social Sciences and Humanities
Language:
en-GB
Published to DARIAH-Campus:
06/02/2024
Content type:
Event
License:
CC BY 4.0
Sources:
DARIAH
Topics:
Editing tools, Data visualisation, Scholarly publishing
Version:
1.0.0