OCR with Google Vision API and Tesseract

OCR with Google Vision API and Tesseract

Authors

Isabelle Gribomont

Topics

Python

Historians working with digital methods and text-based material are often confronted with PDF files that need to be converted to plain text. Whether you are interested in network analysis, named entity recognition, corpus linguistics, text reuse, or any other type of text-based analysis, good quality Optical Character Recognition (OCR), which transforms a PDF to a computer-readable file, will be the first step. However, OCR becomes trickier when dealing with historical fonts and characters, damaged manuscripts or low-quality scans. Fortunately, tools such as Tesseract, TRANSKRIBUS, OCR4all, eScriptorium and OCR-D (among others) have allowed humanities scholars to work with all kinds of documents, from handwritten nineteenth-century letters to medieval manuscripts.

Despite these great tools, it can still be difficult to find an OCR solution that aligns with our technical knowledge, can be easily integrated within a workflow, or can be applied to a multilingual/diverse corpus without requiring any extra input from the user. This lesson offers a possible alternative by introducing two ways of combining Google Vision’s character recognition with Tesseract’s layout detection. Google Cloud Vision is one of the best ‘out-of-the-box’ tools when it comes to recognising individual characters but, contrary to Tesseract, it has poor layout recognition capabilities. Combining both tools creates a “one-size-fits-most” method that will generate high-quality OCR outputs for a wide range of documents.

The principle of exploring different combinations of tools to create customised workflows is widely applicable in digital humanities projects, where tools tailored to our data are not always available.

Reviewed by:

Ryan Cordell
Clemens Neudecker

Learning outcomes

After completing this lesson, you will be able to:

Combine Google Vision’s character recognition with Tesseract’s layout detection to generate high-quality OCR outputs for a wide range of documents
Accurately convert PDF files into plain text
Understand a variety of considerations to keep in mind when converting a PDF to plain text

OCR with Google Vision API and Tesseract

Learning outcomes

Cite as

Reuse conditions

Full metadata