Skip to main content

OCR with Google Vision API and Tesseract

Historians working with digital methods and text-based material are often confronted with PDF files that need to be converted to plain text. Whether you are interested in network analysis, named entity recognition, corpus linguistics, text reuse, or any other type of text-based analysis, good quality Optical Character Recognition (OCR), which transforms a PDF to a computer-readable file, will be the first step. However, OCR becomes trickier when dealing with historical fonts and characters, damaged manuscripts or low-quality scans. Fortunately, tools such as Tesseract, TRANSKRIBUS, OCR4all, eScriptorium and OCR-D (among others) have allowed humanities scholars to work with all kinds of documents, from handwritten nineteenth-century letters to medieval manuscripts.

Despite these great tools, it can still be difficult to find an OCR solution that aligns with our technical knowledge, can be easily integrated within a workflow, or can be applied to a multilingual/diverse corpus without requiring any extra input from the user. This lesson offers a possible alternative by introducing two ways of combining Google Vision’s character recognition with Tesseract’s layout detection. Google Cloud Vision is one of the best ‘out-of-the-box’ tools when it comes to recognising individual characters but, contrary to Tesseract, it has poor layout recognition capabilities. Combining both tools creates a “one-size-fits-most” method that will generate high-quality OCR outputs for a wide range of documents.

The principle of exploring different combinations of tools to create customised workflows is widely applicable in digital humanities projects, where tools tailored to our data are not always available.

Reviewed by:

  • Ryan Cordell
  • Clemens Neudecker

Learning outcomes

After completing this lesson, you will be able to:

  • Combine Google Vision’s character recognition with Tesseract’s layout detection to generate high-quality OCR outputs for a wide range of documents
  • Accurately convert PDF files into plain text
  • Understand a variety of considerations to keep in mind when converting a PDF to plain text
Interested in learning more?

Check out this lesson on Programming Historian's website

Go to this resource

Cite as

Isabelle Gribomont (2023). OCR with Google Vision API and Tesseract. Version 1.0.0. Edited by Liz Fischer. ProgHist Ltd. [Training module]. https://doi.org/10.46430/phen0109

Reuse conditions

Resources hosted on DARIAH-Campus are subjects to the DARIAH-Campus Training Materials Reuse Charter

Full metadata

Title:
OCR with Google Vision API and Tesseract
Authors:
Isabelle Gribomont
Domain:
Social Sciences and Humanities
Language:
en
Published to DARIAH-Campus:
9/18/2024
Originally published:
3/31/2023
Content type:
Training module
Licence:
CCBY 4.0
Sources:
DARIAH
Topics:
Python
Version:
1.0.0