Analyzing Multilingual French and Russian Text using NLTK, spaCy, and Stanza

Analyzing Multilingual French and Russian Text using NLTK, spaCy, and Stanza

Authors

Ian Goodale

Topics

Many of the resources available for learning computational methods of text analysis focus on English-language texts and corpora, and often lack the information which is needed to work with non-English source material. To help remedy this, this lesson will provide an introduction to analyzing non-English and multilingual text (that is, text written in more than one language) using Python. Using a multilingual text composed of Russian and French, this lesson will show how you can use computational methods to perform three fundamental preprocessing tasks: tokenization, part-of-speech tagging, and lemmatization. Then, it will teach you to automatically detect the languages present in a preprocessed text.

To perform the three fundamental preprocessing steps, this lesson uses three common Python packages for Natural Language Processing (NLP): the Natural Language Toolkit (NLTK), spaCy, and Stanza. We’ll start by going over these packages, reviewing and comparing their core features, so you can understand how they work and discern which tool is right for your specific use case and coding style.

Reviewed by:

William Mattingly
Merve Tekgürler

Learning outcomes

After completing this lesson, you will be able to:

Gain some strategies for analyzing non-English, multilingual text
Use computational methods to perform three fundamental preprocessing tasks: tokenization, part-of-speech tagging, and lemmatization
Automatically detect the languages present in a preprocessed text

Analyzing Multilingual French and Russian Text using NLTK, spaCy, and Stanza

Reviewed by:

Learning outcomes

Cite as

Reuse conditions

Full metadata