Python get pdf text

Содержание

Read PDF in Python and convert to text in PDF
pypdf: Pure Python
PDFium: High quality and very fast, but with C-dependency
How to Extract Data from PDF Files with Python
How to Use PDFQuery
Package installation
Import the libraries
Read and convert the PDF files
Access and extract the Data
Conclusion
Extract text from PDF
4 Answers 4

Read PDF in Python and convert to text in PDF

There are various Python packages to extract the text from a PDF with Python. You can see a speed/quality benchmark.

As the maintainer of pypdf and PyPDF2 I am biased, but I would recommend pypdf for people to start. It’s pure-python and a BSD 3-clause license. That should work for most people. Also pypdf can do way more with PDF files (e.g. transformations).

If you feel comfortable with the C-dependency and don’t want to modify the PDF, give pypdfium2 a shot. pypdfium2 is really fast and has an amazing extraction quality.

I previously recommended popplers pdftotext. Don’t use that. It’s quality is worse than PDFium/PyPDF2.

Tika and PyMuPDF work similarly well as PDFium, but they also have the non-python dependency. PyMuPDF might not work for you due to the commercial license.

I would NOT use pdfminer / pdfminer.six / pdfplumber/ pdftotext / borb / PyPDF2 / PyPDF3 / PyPDF4.

pypdf: Pure Python

Installation: pip install pypdf (more instructions)

from pypdf import PdfReader reader = PdfReader("example.pdf") text = "" for page in reader.pages: text += page.extract_text() + "\n"

PDFium: High quality and very fast, but with C-dependency

Installation: pip install pypdfium2

import pypdfium2 as pdfium text = "" pdf = pdfium.PdfDocument(data) for i in range(len(pdf)): page = pdf.get_page(i) textpage = page.get_textpage() text += textpage.get_text() text += "\n" [g.close() for g in (textpage, page)] pdf.close()

Источник

How to Extract Data from PDF Files with Python

Shittu Olumide

Data is present in all areas of the modern digital world, and it takes many different forms.

One of the most common formats for data is PDF. Invoices, reports, and other forms are frequently stored in Portable Document Format (PDF) files by businesses and institutions.

It can be laborious and time-consuming to extract data from PDF files. Fortunately, for easy data extraction from PDF files, Python provides a variety of libraries.

This tutorial will explain how to extract data from PDF files using Python. You’ll learn how to install the necessary libraries and I’ll provide examples of how to do so.

There are several Python libraries you can use to read and extract data from PDF files. These include PDFMiner, PyPDF2, PDFQuery and PyMuPDF. Here, we will use PDFQuery to read and extract data from multiple PDF files.

How to Use PDFQuery

PDFQuery is a Python library that provides an easy way to extract data from PDF files by using CSS-like selectors to locate elements in the document.

It reads a PDF file as an object, converts the PDF object to an XML file, and accesses the desired information by its specific location inside of the PDF document.

Let’s consider a short example to see how it works.

from pdfquery import PDFQuery pdf = PDFQuery('example.pdf') pdf.load() # Use CSS-like selectors to locate the elements text_elements = pdf.pq('LTTextLineHorizontal') # Extract the text from the elements text = [t.text for t in text_elements] print(text)

In this code, we first create a PDFQuery object by passing the filename of the PDF file we want to extract data from. We then load the document into the object by calling the load() method.

Next, we use CSS-like selectors to locate the text elements in the PDF document. The pq() method is used to locate the elements, which returns a PyQuery object that represents the selected elements.

Finally, we extract the text from the elements by accessing the text attribute of each element and we store the extracted text in a list called text .

Let’s consider another method we can use to read PDF files, extract some data elements, and create a structured dataset using PDFQuery. We will follow the following steps:

Package installation.
Import the libraries.
Read and convert the PDF files.
Access and extract the Data.

Package installation

First, we need to install PDFQuery and also install Pandas for some analysis and data presentation.

pip install pdfquery pip install pandas

Import the libraries

import pandas as pd import pdfquery

We import the two libraries to be be able to use them in our project.

Read and convert the PDF files

#read the PDF pdf = pdfquery.PDFQuery('customers.pdf') pdf.load() #convert the pdf to XML pdf.tree.write('customers.xml', pretty_print = True) pdf

We will read the pdf file into our project as an element object and load it. Convert the pdf object into an Extensible Markup Language (XML) file. This file contains the data and the metadata of a given PDF page.

The XML defines a set of rules for encoding PDF in a format that is readable by humans and machines. Looking at the XML file using a text editor, we can see where the data we want to extract is.

Access and extract the Data

We can get the information we are trying to extract inside the LTTextBoxHorizontal tag, and we can see the metadata associated with it.

The values inside the text box, [68.0, 231.57, 101.990, 234.893] in the XML fragment refers to Left, Bottom, Right, Top coordinates of the text box. You can think of this as the boundaries around the data we want to extract.

Let’s access and extract the customer name using the coordinates of the text box.

# access the data using coordinates customer_name = pdf.pq('LTTextLineHorizontal:in_bbox("68.0, 231.57, 101.990, 234.893")').text() print(customer_name) #output: Brandon James

Note: Sometimes the data we want to extract is not in the exact same location in every file which can cause issues. Fortunately, PDFQuery can also query tags that contain a given string.

Conclusion

Data extraction from PDF files is a crucial task because these files are frequently used for document storage and sharing.

Python’s PDFQuery is a potent tool for extracting data from PDF files. Anyone looking to extract data from PDF files will find PDFQuery to be a great option thanks to its simple syntax and comprehensive documentation. It is also open-source and can be modified to suit specific use cases.

Let’s connect on Twitter and on LinkedIn. You can also subscribe to my YouTube channel.

Источник

Extract text from PDF

I have a bunch of PDF files that I need to convert to TXT. Unfortunately, when i use one of the many available utilities to do this, it loses all formatting and all the tabulated data in the PDF gets jumbled up. Is it possible to use Python to extract the text from the PDF by specifying postions, etc? Thanks.

4 Answers 4

PDFs do not contain tabular data unless it contains structured content. Some tools include heuristics to try and guess the data structure and put it back. I wrote a blog article explaining the issues with PDF text extraction at http://www.jpedal.org/PDFblog/2009/04/pdf-text/

If there a way to check whether a PDF is tagged as Adobe’s Structured Content as you wrote in your blog post? Thank you,

$ pdftotext -layout thingwithtablesinit.pdf

will produce a text file thingwithtablesinit.txt with the tables right.

I had a similar problem and ended up using XPDF from http://www.foolabs.com/xpdf/ One of the utils is PDFtoText, but I guess it all comes up to, how the PDF was produced.

I tried several methods as well. I used PyPDF, and PDF Miner, and even using Acrobat to save to text. None of them worked as well as either xpdf’s pdftotext using the -layout option. I wouldn’t bother with anything else.

As explained in other answers, extracting text from PDF is not a straight forward task. However there are certain Python libraries such as pdfminer (pdfminer3k for Python 3) that are reasonably efficient.

The code snippet below shows a Python class which can be instantiated to extract text from PDF. This will work in most of the cases.

# Python 2.7.6 # PdfAdapter.py """ Reusable library to extract text from pdf file Uses pdfminer library; For Python 3.x use pdfminer3k module Below links have useful information on components of the program https://euske.github.io/pdfminer/programming.html http://denis.papathanasiou.org/posts/2010.08.04.post.html """ from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage # From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter # from pdfminer.pdfdevice import PDFDevice # To raise exception whenever text extraction from PDF is not allowed from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.layout import LAParams, LTTextBox, LTTextLine from pdfminer.converter import PDFPageAggregator import logging __doc__ = "eusable library to extract text from pdf file" __name__ = "pdfAdapter" """ Basic logging config """ log = logging.getLogger(__name__) log.addHandler(logging.NullHandler()) class pdf_text_extractor: """ Modules overview: - PDFParser: fetches data from pdf file - PDFDocument: stores data parsed by PDFParser - PDFPageInterpreter: processes page contents from PDFDocument - PDFDevice: translates processed information from PDFPageInterpreter to whatever you need - PDFResourceManager: Stores shared resources such as fonts or images used by both PDFPageInterpreter and PDFDevice - LAParams: A layout analyzer returns a LTPage object for each page in the PDF document - PDFPageAggregator: Extract the decive to page aggregator to get LT object elements """ def __init__(self, pdf_file_path, password=""): """ Class initialization block. Pdf_file_path - Full path of pdf including name password = If not passed, assumed as none """ self.pdf_file_path = pdf_file_path self.password = password def getText(self): """ Algorithm: 1) Txr information from PDF file to PDF document object using parser 2) Open the PDF file 3) Parse the file using PDFParser object 4) Assign the parsed content to PDFDocument object 5) Now the information in this PDFDocumet object has to be processed. For this we need PDFPageInterpreter, PDFDevice and PDFResourceManager 6) Finally process the file page by page """ # Open and read the pdf file in binary mode with open(self.pdf_file_path, "rb") as fp: # Create parser object to parse the pdf content parser = PDFParser(fp) # Store the parsed content in PDFDocument object document = PDFDocument(parser, self.password) # Check if document is extractable, if not abort if not document.is_extractable: raise PDFTextExtractionNotAllowed # Create PDFResourceManager object that stores shared resources # such as fonts or images rsrcmgr = PDFResourceManager() # set parameters for analysis laparams = LAParams() # Create a PDFDevice object which translates interpreted # information into desired format # Device to connect to resource manager to store shared resources # device = PDFDevice(rsrcmgr) # Extract the decive to page aggregator to get LT object elements device = PDFPageAggregator(rsrcmgr, laparams=laparams) # Create interpreter object to process content from PDFDocument # Interpreter needs to be connected to resource manager for shared # resources and device interpreter = PDFPageInterpreter(rsrcmgr, device) # Initialize the text extracted_text = "" # Ok now that we have everything to process a pdf document, # lets process it page by page for page in PDFPage.create_pages(document): # As the interpreter processes the page stored in PDFDocument # object interpreter.process_page(page) # The device renders the layout from interpreter layout = device.get_result() # Out of the many LT objects within layout, we are interested # in LTTextBox and LTTextLine for lt_obj in layout: if (isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine)): extracted_text += lt_obj.get_text() return extracted_text.encode("utf-8")

Note — There are other libraries such as PyPDF2 which are good at transforming a PDF, such as merging PDF pages, splitting or cropping specific pages out of PDF etc.

Читайте также: Готовая таблица html css

Источник