Analyzing Multilingual French and Russian Text using NLTK, spaCy, and Stanza
- Authors
- Topics:
Many of the resources available for learning computational methods of text analysis focus on English-language texts and corpora, and often lack the information which is needed to work with non-English source material. To help remedy this, this lesson will provide an introduction to analyzing non-English and multilingual text (that is, text written in more than one language) using Python. Using a multilingual text composed of Russian and French, this lesson will show how you can use computational methods to perform three fundamental preprocessing tasks: tokenization, part-of-speech tagging, and lemmatization. Then, it will teach you to automatically detect the languages present in a preprocessed text.
To perform the three fundamental preprocessing steps, this lesson uses three common Python packages for Natural Language Processing (NLP): the Natural Language Toolkit (NLTK), spaCy, and Stanza. We’ll start by going over these packages, reviewing and comparing their core features, so you can understand how they work and discern which tool is right for your specific use case and coding style.
Reviewed by:
- William Mattingly
- Merve Tekgürler
Learning outcomes
After completing this lesson, you will be able to:
- Gain some strategies for analyzing non-English, multilingual text
- Use computational methods to perform three fundamental preprocessing tasks: tokenization, part-of-speech tagging, and lemmatization
- Automatically detect the languages present in a preprocessed text
Check out this lesson on Programming Historian's website
Go to this resource