Researchers often need to be able to search a corpus of texts for a defined list of terms. In many cases, historians are interested in certain places named in a text or texts. This lesson details how to programmatically search documents for a list of terms, including place names. To begin, we will produce a tab-separated value (TSV) file where each row gives the matched term and the term’s location in the text. We also generate a visualisation that can be used to interpret the matches in context and to assess their usefulness for a given project. The goal of the lesson is to systematically search a text corpus for place names and then to use a service to locate and map historic place names.
This lesson will be useful for anyone wishing to perform named entity recognition (NER) on a text corpus. Other users may wish to skip the text extraction portion of this lesson and focus solely on the spatial elements of the lesson, that is gazetteer building and using the World Historical Gazetteer (WHG). These spatial steps are especially useful for someone looking to create maps depicting historical information in a largely point and click interface. We have designed this lesson to show how to combine text analysis with mapping, but understand that some readers may only be interested in one of these two methodologies. We urge you to try both parts of the lesson together if you have time, as this will enable you to learn how text analysis and mapping can be combined in one project. Additionally, it will demonstrate how the results of these two activities can be ported into another form of digital analysis.
After completing this lesson, you will be able to:
- Programmatically search documents for a list of terms, including place names
- Produce a tab-separated value (TSV) file where each row gives the matched term and the term’s location in the text
- Generate a visualisation that can be used to interpret the matches in context
- Combine text analysis with mapping
Check out this lesson on Programming Historian's websiteGo to this resource