Skip to main content

Analyzing Multilingual French and Russian Text using NLTK, spaCy, and Stanza

Many of the resources available for learning computational methods of text analysis focus on English-language texts and corpora, and often lack the information which is needed to work with non-English source material. To help remedy this, this lesson will provide an introduction to analyzing non-English and multilingual text (that is, text written in more than one language) using Python. Using a multilingual text composed of Russian and French, this lesson will show how you can use computational methods to perform three fundamental preprocessing tasks: tokenization, part-of-speech tagging, and lemmatization. Then, it will teach you to automatically detect the languages present in a preprocessed text.

To perform the three fundamental preprocessing steps, this lesson uses three common Python packages for Natural Language Processing (NLP): the Natural Language Toolkit (NLTK), spaCy, and Stanza. We’ll start by going over these packages, reviewing and comparing their core features, so you can understand how they work and discern which tool is right for your specific use case and coding style.

Reviewed by:

  • William Mattingly
  • Merve Tekgürler

Learning outcomes

After completing this lesson, you will be able to:

  • Gain some strategies for analyzing non-English, multilingual text
  • Use computational methods to perform three fundamental preprocessing tasks: tokenization, part-of-speech tagging, and lemmatization
  • Automatically detect the languages present in a preprocessed text
Interested in learning more?

Check out this lesson on Programming Historian's website

Go to this resource

Cite as

Ian Goodale (2024). Analyzing Multilingual French and Russian Text using NLTK, spaCy, and Stanza. Version 1.0.0. Edited by Laura Alice Chapot. ProgHist Ltd. [Training module]. https://doi.org/10.46430/phen0121

Reuse conditions

Resources hosted on DARIAH-Campus are subjects to the DARIAH-Campus Training Materials Reuse Charter

Full metadata

Title:
Analyzing Multilingual French and Russian Text using NLTK, spaCy, and Stanza
Authors:
Ian Goodale
Domain:
Social Sciences and Humanities
Language:
en
Published to DARIAH-Campus:
1/30/2025
Originally published:
11/13/2024
Content type:
Training module
Licence:
CCBY 4.0
Sources:
Programming Historian
Topics:
Python, Data management
Version:
1.0.0