Digging for Gold - Knowledge Extraction from Text

About the Local Host - UNED
UNED is a distance-university and the largest learning university in Spain in terms of the number of students enrolled. The National University of Distance Education has as its mission the public service of higher education through the modality of distance education. To respond to this mission, it facilitates access to university education and the continuity of their studies for all people qualified to pursue higher studies who choose the UNED educational system for its methodology or for work, economic, residence or any other.
The UNED is the public university with the largest number of students in the country, more than 250,000. That it has more than 40 years of experience making the principle of equal opportunities in access to higher education a reality thanks to a methodology based on the principles of distance learning, and focused on the needs of the student. Leader in the application of cutting-edge technologies applied to learning, with the largest offer of virtual courses in the country. It has a very wide training offer:
- 27 degrees adapted to the European Higher Education Area (EHEA)
- 65 official university Master’s degrees adapted to the EHEA.
- 18 PhD programmes adapted to the EHEA.
- 4 diplomas and 5 technical engineering (all in extinction).
- 18 degrees and 3 engineering (all in extinction).
- University Access Courses for over 25s and over 45s (CAD).
- University Access Tests for over 25s and over 45s (CAD).
- Access to the University by professional or work experience (for people over 40 years of age)
- 13 languages are offered at the Distance Learning University Centre (CUID) – foreign languages, Spanish for foreigners and co-official languages.
- Permanent Training (more than 600 courses).
- University extension courses (more than 350 courses).
- Summer courses (more than 140 courses).
UNED Senior is designed for people over 55 years of age, regardless of their academic training, with the aim of expanding knowledge on current issues.
UNED Open is a new channel created to facilitate the search for open educational content from UNED.
With more than 1,400 professors, in 9 Faculties, 2 Technical Schools and CAD at the Headquarters, and more than 6,500 tutor professors, distributed in the Associated Centres.
Implemented in all the Autonomous Communities through 61 Associated Centres, more than 100 extensions and classrooms, and 28 zone centres in the Community of Madrid where tutorials are given and face-to-face tests prepared by the teaching teams are carried out.
UNED has a presence in 11 countries in Europe, America and Africa through 12 UNED Centres abroad and 4 concerted centres for exams. Especially concerned with responding to the demands for training of groups with special needs (people with disabilities, immigrants, reintegration of the prison population).
Session 1 Python
To get the most out of this session you will need to have a few things downloaded to your machine.
Anaconda Installing: Local installation of Anaconda (Jupyter Notebook) https://www.anaconda.com/ (download)
Google Collaborate You will use Google Collaborate to access files on your browser https://colab.research.google.com/notebooks/intro.ipynb?hl=en
You will download files from this GitHub repository: https://github.com/sros-UNED/NLPforHumanist download them through this link (code —> download zip)
Session 5-8 Preparation Notes
The most important technical requirement for Day 2 would be installing
R and (preferably)
RStudio (development environment for R that will
make things easier). We will use the stylo
library for most of the day; see step-by-step
introduction to stylo
for beginners,
or the more extensive HOW
TO.
In case you don’t / can’t install R and stylo
locally, there will be an option to run analysis
from Colaboratory notebooks, just be aware it will require more coding, not less coding, because
of the stylo
’s interface that doesn’t work in Colab. You are free to bring your own collection of
texts to the workshop, but you can find more ready to use plain text fiction collections in
various languages on the Computational Stylistics
website.
Materials
We will use GDrive folder (https://drive.google.com/drive/folders/1V8KFkDxGpKvYqjbkymqukH1QdjwTJeak?usp=sharing) holds all necessary materials.
Your options are:
- Download it and work locally.
- Download -> re-upload folder to work in Colabs’ virtual machine with R (or just open the .ipynb notebook, then File -> Save a copy in Drive. It will create a copy of the notebook on your own Drive.)
To copy folder to your own GDrive, do this:-
- Download CLS_Madrid_Folder
- Extract the downloaded .zip
- Upload the extracted folder back to your GDrive
1.Digging for Gold I: Introduction to Python
This lesson provides an introduction to variables, operators, loops, lists, and dictionaries. Students will learn how to use lists and dictionaries to manipulate data in their programs. By the end of the lesson, students will have a foundational understanding of Python’s core concepts.
Digging for Gold I: Introduction to Python
2.Digging for Gold II: Extracting useful information from a corpus pt1 - Cleaning
In this lesson, students will learn how to use the Natural Language Toolkit (nltk) to download a corpus and then use Python and regular expressions to clean the text data. Students will learn how to remove unwanted characters and symbols, tokenise the text into individual words, and how to remove stop words. This lesson will equip students with the skills to clean and preprocess text data for further analysis and natural language processing tasks.
Digging for Gold II: Extracting useful information from a corpus pt1 - Cleaning
3.Digging for Gold III: Extracting useful information from a corpus pt2
This is a continuation of lesson 2 where you will find how to use more advanced Python data structures and we will introduce how to use the Guttenberg project Python library to download corpora and how to clean text. Finally, students will learn what is and how to use the panda’s library for working and processing very large corpora.
Digging for Gold III: Extracting useful information from a corpus pt2
4.Digging for Gold IV: Word Embeddings
In this lesson, students will use the natural language processing library, spaCy, to extract all the “scary” verbs from a corpus of horror books. They will then use the similarity function in spaCy to determine which verbs are most closely related to the concept of fear. Additionally, they will determine which book is the “scariest” by calculating the ratio of scary verbs to total words in each book. This exercise will provide students with an understanding of how natural language processing can be used to analyse and compare different works of literature based on a specific set of criteria.
Digging for Gold IV: Word Embeddings pt 1
Digging for Gold IV: Word Embeddings pt 2
5.Digging for Gold V: Vector Semantics and Embeddings
Embedding spaces and computational literary studies have emerged as a fruitful convergence of computer science and literary analysis. This presentation introduces the concept of embedding spaces, which represent textual data as high-dimensional vectors, capturing semantic relationships and contextual information. Specifically, word embeddings enable tasks such as sentiment analysis and authorship attribution, while sentence and document embeddings facilitate analysis at larger text units. By leveraging computational methods, researchers can delve into the complexities of literature, offering quantitative insights and novel research avenues. This interdisciplinary approach holds great promise for revolutionising the study of literature and deepening our understanding of its intricacies.
Digging for Gold V: Vector Semantics and Embeddings
6.Finding Gold I: Stylometry - Distances and differences
This is an introductory overview of the field of stylometry and multivariate text analysis. We discuss classic approaches to text representation, such as bag of words, and show the ability of word frequencies to reflect meaningful cultural and social conditions of texts: genre, chronology and authorship.
R and RStudio
The most important thing to do would be installing R and RStudio on your machine. We will use the stylo library for most of the day; if you want, you can look at step-by-step introduction to stylo for beginners, or at the more extensive HOW TO, but we will cover all the basics. NB. In case you don’t / can’t install R and stylo locally, there will be an option to run analysis from Colab, just be aware it will require more coding, not less coding, because of the stylo’s interface that doesn’t work in Colab. You are free to bring your own collection of texts to the workshop, but you also can find plain text fiction collections in various languages on the Computational Stylistics website.
Materials
We will use GDrive folder that holds all necessary materials. Your options are:
- Download it and work locally
- Download -> re-upload folder to work in Colabs’ virtual machine with R (or just open the .ipynb notebook, then File -> Save a copy in Drive. It will create a copy of the notebook on your own Drive.)
To copy folder to your own GDrive, do this:
- Download CLS_Madrid_Folder
- Extract the downloaded .zip
- Upload the extracted folder back to your GDrive
Finding Gold I: Stylometry - Distances and differences
7.Finding Gold II: Stylometry with R
This session introduces the ‘stylo’ library for R that allows to perform different stylometric analysis on a collection of documents. We quickly introduce R language and go over a graphical user interface of ‘stylo’ to show practicalities of feature selection, distance metrics, cluster analysis and sampling.
Note - Download instructions for software and materials can be found in Session 5
Finding Gold II: Stylometry with R
8.Finding Gold III: Keywords and associations
In this session, we look at the ideas of the word ‘keyness’ and the ways to understand which features differ between texts and corpora. We also show how to trace features that might underlie text groupings and clusters and detect a potential bias, or systematic error in a corpus.
Finding Gold III: Keywords and associations
Note - Download instructions for software and materials can be found in Session 5
9.Finding Gold IV: Case-study - Stylometry applied to Old Spanish Poetry
In this session, we will try to find out whether the Old Spanish version of the Book of Alexander was written by an author known as Gonzalo de Berceo, as it is stated in one manuscript, or not. We will gather the data from the Old Spanish Textual Archive (OSTA), a corpus of Old Spanish Texts lemmatised and PoS tagged. Then we will use the stylo package to find it out. Afterwards, we will try to see if the rhyming words are a good element to establish the authorship of the Book of Alexander.
Finding Gold IV: Case-study - Stylometry applied to Old Spanish Poetry
10.Showing Gold I: Visualisation
This session aims to introduce visualisation grammars and their applications to accelerate exploratory data analysis in digital humanities projects by allowing users to create interactive visualisations with minimal effort in a Jupyter notebook. To this end, a real application will be built by replicating the design process of a visualisation system based on machine-annotated textual data, in a similar setup to many digital humanities research projects.
You can download the course materials for this session here.
Showing Gold I: Visualisation
11.Storing Gold II: Lindat/Teitok
Corpus data should be kept FAIR: findable, accessible, interoperable and reusable. This lecture will show how to do that using an example set-up in which the LINDAT repository is used for findability (as well as long-term preservation), TEITOK for accessibility (as well as maintenance and annotation), and TEI for interoperability and reusability. And we will show how to interact with these tools and standard both via a GUI (online) and using an API (to interact programmatically).
Storing Gold II: Lindat/Teitok