Python scan to text

Extracting Text From Images Using PyTesseract

As a developer, you might want to extract textual information from an image. Using Python, we can create a program that extracts such textual data from any given image.

Python has been one of the most popular languages developers enjoy working with. Its human-readable syntax makes it easy to learn.

In this guide, we will write a Python script that extracts images, scans for text, transcribes it, and saves it to a text file. We will use the Python tesseract library to recognize textual data from images.

Table of contents

Prerequisities

To follow along with this article, ensure that you have Python installed and running on your computer.

Also, ensure you have some basic understanding of Python.

Setting up tesseract OCR

Optical Character Recognition (OCR) is a technology that is used to recognize text from images. It can be used to convert tight handwritten or printed texts into machine-readable texts.

To use OCR, you need to install and configure tesseract on your computer.

First, download the Tesseract OCR executables here. While installing this executable, make sure you copy the tesseract installation path and add it to your system environment varibales.

Once the process is done, run the tesseract -v command to verify that the OCR is installed.

tesseract

To test whether this environment is working, you may run OCR on any image and see if the textual data gets extracted and saved in a readable text file.

To do that, ensure you have an image with textual information. Use your command line to navigate to the image location and run the following tesseract command:

In this case, you will provide the image name and the file name. When the command is executed, a .txt file will be created and saved in the same folder.

This confirms that the tesseract library is successfully installed. We may now proceed to implement the same using a Python script.

Adding the project dependencies

We need to install a few dependent libraries to help us get started with the Python script.

Pytesseract

Python-tesseract is an OCR library that is used to scan and transcribe any textual data in images. This library is used to recognize textual information but not to save it to any text document.

To install pytesseract , run the following command:

PyMuPDF

PyMuPDF is a python library that is used to access file documents and images, such as PDFs.

In this application, PyMuPDF will read PDF documents and check for any saved images. PyMuPDF renders the PDF files into PNG formats, scans for any text, and finally extracts the text from the rendered PNG images.

Читайте также:  Python package for django

To install PyMuPDF , run the following command:

Pillow

Pillow library acts as an image interpreter with all image processing capabilities.

To install pillow, run the following command:

Opencv-python

Opencv-python is used to read images and videos, manipulate media files with image transformations, draw shapes, and put text on those files.

We will use OpenCV to recognize texts from the media files (images).

To install opencv-python, run the following command:

Create a Python tesseract script

Create a project folder and add a new main.py file inside that folder.

First, we need to import these library dependencies that we installed. Add the following imports inside the main.py file:

import os # from native modules import fitz # from PyMuPDF import pytesseract # from pytesseract import cv2 # from Opencv import io # from native modules from PIL import Image, ImageFile # from Pillow from colorama import Fore # from native modules import platform # from native modules 

Then, allow this application to process the image files:

ImageFile.LOAD_TRUNCATED_IMAGES = True 

Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.

In this case, we need to create a few global variables that help to create and save these images to the project path. We also specify the path to save the extracted text into a .txt file.

Go ahead and add these global variables as shown:

# Global variables strPDF, textScanned, textScanned, inputTeEx, dirName = "", "", "", "", [  "images", "output_txt"] 

This will create a directory images where the PDF extracted images will be saved. An output_txt directory will be created to save the scanned text information as .txt file.

Now, let’s create the method that helps us access the installed tesseract library, and the required files. We will do this under gInUs() function as shown:

def gInUs():  # Global var  global strPDF  global inputTeEx  if(platform.system() == "Windows"):  # Print input  print(Fore.YELLOW +  "[.] Add the tesseract.exe local path" + Fore.RESET)  inputTeEx = input()  # Print input  print(Fore.GREEN + "[!] Add the PDF file local path:" + Fore.RESET)  inputUser = input()  # Print an alert if input is not valid, if not, call to fun reDoc  if(inputUser == "" or len(inputUser.split("\\")) == 1):  print(Fore.RED + "[X] Please enter a valid PATH to a file" + Fore.RESET)  else:  extIm(inputUser) 
  • «[.] Add the tesseract.exe local path» — it helps us access the tesseract library.
  • «[!] Add the PDF file local path:» — it helps us access the local PDF file we want to use.

Once we enter this path, we need first to verify whether the file path is correct. If the path is incorrect, the application will display Please enter a valid PATH to a file error message. If the path is correct, the application will extract text from the images by executing the extIm() method.

Extract images

Once we have the correct PDF file path, we need to run the file and extract the text to the .txt file.

Читайте также:  Linking Pages in HTML

First, we need to open the text file and read its contents. To do that, we will use the fitz module as shown below:

# Extracting images def extIm(fileStr):  # open the file  pdf_file = fitz.open(fileStr) 

We create a path to save the images that we extract from the file:

global dirName # Create output folder if don't exists for i in dirName:  try:  os.makedirs(i)  print(Fore.GREEN + "[!] Directory ", i, " Created" + Fore.RESET)  except FileExistsError:  print(Fore.RED + "[X] Directory ", i,  " already exists" + Fore.RESET)  content = os.listdir("images") 

We need to check if there are any images available in the folder. If so, list them and print the contents of each image as shown:

# List images if exists and print each one. if not extract all images if(len(content) >= 1):  # Print every img in content  for i in content:  print(Fore.YELLOW + f"This is an image: i>" + Fore.RESET) else:  # Iterate over PDF pages  for page_index in range(len(pdf_file)):  # get the page itself  page = pdf_file[page_index]  image_list = page.getImageList() 

If no images are available in the folder, we iterate over the PDF files and extract their contents.

Let’s print the count of total images that we have extracted and display an error message if no image is found in the folder:

# printing number of images found on this page if image_list:  print(  Fore.GREEN + f"[+] Found a total of len(image_list)> images in page page_index>" + Fore.RESET) else:  print(Fore.RED + "[!] No images found on page",  page_index, Fore.RESET) 

In the loop, we name every image that is generated from the PDF. Here, we will append the image count to the string image . For example, image2_1 :

for (image_index, img) in enumerate(page.getImageList(), start=1):  # get the XREF of the image  xref = img[0]  # extract the image bytes  base_image = pdf_file.extractImage(xref)  image_bytes = base_image["image"]  # get the image extension  image_ext = base_image["ext"]  # load it to PIL  image = Image.open(io.BytesIO(image_bytes))  # save it to local disk  image.save(  open(f"images/imagepage_index+1>_image_index>.image_ext>", "wb")) reImg() 

Here, we execute the function reImg() to render these images and extract their content. Let’s do this in the next step.

Extract text information

Let’s create a function named reImg() to hold these global variables:

def reImg():  # Global var  global textScanned  global dirName  global inputTeEx 

At this point, we will have to access the tesseract.exe file. To do that, we use the global variable inputTeEx , where we accept the file path from the user:

pytesseract.pytesseract.tesseract_cmd = f"inputTeEx>" 

Python will use the pytesseract module to access the tesseract through the cmd .

We need to loop through each extracted images and read its content to extract textual information as shown:

# List the images content = os.listdir('images') for i in range(len(content)):  # Reading each image in images  image = cv2.imread(f'images/content[i]>')  # Scan text from image  print(Fore.YELLOW + f"[.] Scan text from content[i]>" + Fore.RESET)  text = pytesseract.image_to_string(image, lang='spa')  # Concate text scanned in a string  textScanned += text  # print  print(Fore.GREEN + "[!] Finished scan text" + Fore.RESET)  # Showing img input  cv2.imshow('Image', image)  # 0.5 milisecond  cv2.waitKey(1000) 
# Create and write file txtResult.txt print(Fore.CYAN + "[.] Writing txtResult.txt" + Fore.RESET) fileTxt = open(f"dirName[1]>/txtResult.txt", "w") fileTxt.write(textScanned) print(Fore.GREEN + "[!] File Writted" + Fore.RESET) 

Finally, call the gInUs() function to execute the program:

# Call to main function gInUs() 

Test the app

To test the app, run python main.py .

First provide the tesseract path and hit enter:

> [!] Add the tesseract.exe local path 

Once you hit enter, you will be instructed to add the PDF path:

> [!] Add the PDF file local path 

On execution, the program creates an output_txt folder to save the extracted text information in .txt files.

Conclusion

In this guide, we created a Python script that extracts textual information from the images by scanning, transcribing, and saving it to a text file. You can get the code used in this guide on GitHub.

I hope you found this tutorial helpful.

Clap 👏 If this article helps you.

Peer Review Contributions by: Srishilesh P S

Источник

Extracting Text from Scanned PDF using Pytesseract & Open CV

Document Intelligence using Python and other open source libraries

The process of extracting information from a digital copy of invoice can be a tricky task. There are various tools that are available in the market that can be used to perform this task. However there are many factors due to which most of the people want to solve this problem using Open Source Libraries.

I came across a similar set of problem a few days back and wanted to share with you all the approach through which I solved this problem. The libraries that I used for developing this solution were pdf2image (for converting PDF to images), OpenCV (for Image pre-processing) and finally PyTesseract for OCR along with Python.

Converting PDF to Image

pdf2image is a python library which converts PDF to a sequence of PIL Image objects using pdftoppm library. The following command can be used for installing the pdf2image library using pip installation method.

Note: pdf2image uses Poppler which is a PDF rendering library based on the xpdf-3.0 code base and will not work without it. Please refer to the below resources for downloading and installation instructions for Poppler.

After installation, any pdf can be converted to images using the below code.

After converting the PDF to images, the next step is to highlight the regions of the images from which we have to extract the information.

Note: Before marking regions make sure that you have preprocessed the image for improving its quality (DPI ≥ 300, Skewness, Sharpness and Brightness should be adjusted, Thresholding etc.)

Marking Regions of Image for Information Extraction

Here in this step we will mark the regions of the image from where we have to extract the data. After marking those…

Источник

Оцените статью