Skip to main content

Corpus Analysis with spaCy

Say you have a big collection of texts. Maybe you’ve gathered speeches from the French Revolution, compiled a bunch of Amazon product reviews, or unearthed a collection of diary entries written during the first world war. In any of these cases, computational analysis can be a good way to compliment close reading of your corpus… but where should you start?

One possible way to begin is with spaCy, an industrial-strength library for Natural Language Processing (NLP) in Python. spaCy is capable of processing large corpora, generating linguistic annotations including part-of-speech tags and named entities, as well as preparing texts for further machine classification. This lesson is a ‘spaCy 101’ of sorts, a primer for researchers who are new to spaCy and want to learn how it can be used for corpus analysis. It may also be useful for those who are curious about natural language processing tools in general, and how they can help us to answer humanities research questions.

Reviewed by:

  • Maria Antoniak
  • William Mattingly

Learning outcomes

After completing this lesson, you will be able to:

  • Upload a corpus of texts to a platform for Python analysis (using Google Colaboratory)
  • Use spaCy to enrich the corpus through tokenization, lemmatization, part-of-speech tagging, dependency parsing and chunking, and named entity recognition
  • Conduct frequency analyses using part-of-speech tags and named entities
  • Download an enriched dataset for use in future NLP analyses
Interested in learning more?

Check out this lesson on Programming Historian's website

Go to this resource

Cite as

Megan S. Kane (2023). Corpus Analysis with spaCy. Version 1.0.0. Edited by John R Ladd. ProgHist Ltd. [Training module]. https://doi.org/10.46430/phen0113

Reuse conditions

Resources hosted on DARIAH-Campus are subjects to the DARIAH-Campus Training Materials Reuse Charter

Full metadata

Title:
Corpus Analysis with spaCy
Authors:
Megan S. Kane
Domain:
Social Sciences and Humanities
Language:
en
Published to DARIAH-Campus:
11/18/2024
Originally published:
11/2/2023
Content type:
Training module
Licence:
CCBY 4.0
Sources:
Programming Historian
Topics:
Python, Big data
Version:
1.0.0