Reading pdf in python

Содержание

How to Read PDF Files in Python
Reading and Extracting Text from PDF Files in Python
Using PDFQuery
Installation
Open a New Python File
Code For Reading PDF Files and Converting to XML
Code For Extracting Text from XML File
The IronPDF Library
IronPDF Blog

How to Read PDF Files in Python

PDF files (Portable Document Format) are the most popular digital file format for sending and receiving data online. It is mainly used to preserve the data formatting and secure the data with an encrypted password.

In this article, we are going to read content from a PDF file in Python and C#. There are a bunch of online options available but here we will use a Python library for extracting document information from PDF files. The following libraries are available to work with PDF documents in Python:

From the above-mentioned PDF Python libraries, we can use one to extract PDF page text, extract data from PDF tables, or extract data from a particular section.

Reading and Extracting Text from PDF Files in Python

Following the easy steps, we can easily extract text from any PDF document:

Install any PDF Python library
Load and Open one of the preexisting PDF files
Read the PDF file with appropriate library methods
Use different extract methods to implement text extraction
Finally, print the extracted text or save it to a text file

Using PDFQuery

PDFQuery is an open-source pure Python PDF library that is free to use for working with PDF files. It is designed to extract data from multiple PDF files with relatively easy syntax. It is built upon PDFMiner, lxml, and PyQuery, which makes it more enhanced in features. You can read and extract text from any position in the PDF file, or PDF rotate pages either by providing coordinates, exact text match, or some keywords.

Installation

To install any Python library, a pip package manager is required. It automatically downloads and installs the request package using the pip command.

Simply type the following command in Windows cmd or PowerShell to download PDFQuery:

Note: While installing Python, it must be added to the path environment variable in order to execute the above command from anywhere in cmd or PowerShell. pip3 is recommended to be used for Python 3, as it is the updated version.

Open a New Python File

To use the PDFQuery library, we need to write the code in a Python file and execute it to read and extract text from PDF files.

Search and Open Python default IDLE shell from the Windows search bar and press Ctrl + N or select «New File» from the File tab. This will open the text editor for writing the code.
Next, save the file with an appropriate name. I’m naming it «readpdf.py». The file looks like this:

To start reading and extracting text using the PDFQuery library, we need to have a PDF document.

The following single-page document will be used as input PDF for extracting data in Python:

Code For Reading PDF Files and Converting to XML

The following code will load and open the PDF in Python by creating a PDF reader object. Then we will convert it to an XML file by PDF processing tree.write method for further text extraction:

import pdfquery import pandas as pd pdf = pdfquery.PDFQuery('example.pdf') pdf.load() #convert the pdf to XML pdf.tree.write('test2.xml', pretty_print = True)

import pdfquery import pandas as pd pdf = pdfquery.PDFQuery('example.pdf') pdf.load() #convert the pdf to XML pdf.tree.write('test2.xml', pretty_print = True)

Note: The load method runs slowly on initial runs as it uses the PDFMiner engine to load the PDF file which compares each element on the page with every other element. To load it quickly we need to cache the file for next use. Please visit this link for more detail about caching.

The output of the above code on Python Shell is as follows:

The above XML file format contains both the PDF data and metadata of multiple PDF pages.

Code For Extracting Text from XML File

XML file encodes the PDF in such a format that both humans and machines can read it easily. We first need to look at the XML file for the data we need to extract using Python scripts.

The following code snippet is from the XML file for the text Notification:

We can see the text «Notification» inside the LTTextLineHorizontal tag and the metadata along with it.

Now we can use PDFQuery in_bbox method to pass the bounding box coordinates to get the text from this tag. The coordinates in the fragment refer to the left, bottom, right, and top coordinates of the boundaries of the data we are extracting respectively.

The code is simple and as follows:

text = pdf.pq('LTTextLineHorizontal:in_bbox("1216.361, 1421.998, 1373.641, 1448.411")').text() print(text)

text = pdf.pq('LTTextLineHorizontal:in_bbox("1216.361, 1421.998, 1373.641, 1448.411")').text() print(text)

The text method gets the actual «NOTIFICATION» text from the bounding box. The output is as follows:

There is another way to find the text from the PDF file. This will search the entire PDF pages and retrieve the required information.

text = pdf.pq('LTTextLineHorizontal:contains("NOTIFICATION")').text()

text = pdf.pq('LTTextLineHorizontal:contains("NOTIFICATION")').text()

The output is exactly the same. In this way, the required information can be fetched from the PDF-encoded XML file and displayed accordingly using the Pandas DataFrame .

The IronPDF Library

IronPDF is a useful tool for creating and extracting text from PDF documents in .NET projects. A common use of this library is “HTML to PDF” rendering, where HTML is used as the design language for rendering a new PDF file.

IronPDF uses a Chromium engine to render HTML pages to PDF files. With HTML to PDF file conversion, there is no need to use complex APIs to position or design PDF documents. IronPDF also supports all standard web page technologies: HTML, ASPX, JS, CSS, and images.

It also enables you to create a .NET PDF library using HTML5, CSS, JavaScript, and images. You can edit, stamp, add headers and footers to a PDF file, and even rotating PDF pages seem effortless. Furthermore, it makes it very easy to read PDF file text and extract images.

In many cases, you can extract embedded text from PDFs directly.

The following code helps you read and extract text and images from PDF files:

from ironpdf import * # Load existing PDF document pdf = PdfDocument.FromFile("content.pdf") # Extract text and images from PDF document all_text = pdf.ExtractAllText() all_images = pdf.ExtractAllImages() # Or, extract text and images from specific pages page_2_text = pdf.ExtractTextFromPage(1) for index in range(pdf.PageCount): page_number = index + 1 images = pdf.ExtractBitmapsFromPage(index)

from ironpdf import * # Load existing PDF document pdf = PdfDocument.FromFile("content.pdf") # Extract text and images from PDF document all_text = pdf.ExtractAllText() all_images = pdf.ExtractAllImages() # Or, extract text and images from specific pages page_2_text = pdf.ExtractTextFromPage(1) for index in range(pdf.PageCount): page_number = index + 1 images = pdf.ExtractBitmapsFromPage(index)

You can see from the above code that it’s quite simple and clean. Very few lines of code are needed to read and extract text and images from PDF files. It’s a fast, reliable, and time-saving solution with accurate results. Fewer steps are involved to achieve the same result as the code snippet in Python.

For more detail on reading PDF files in C# visit this code link here.

Download IronPDF and try it for free with a 30-day trial.

IronPDF Blog

Источник