Pandas python чтение pdf

Python Импорт данных №5. Импорт таблиц из PDF

Нам нужно импортировать таблицы определенного типа из множества PDF файлов и объединить их в одну таблицу по вертикали.

Дополнительная сложность в том, что PDF файлы содержат таблицы разного вида. Нам нужно отобрать только определенные.

Решение

Для решения нам понадобится 3 модуля: pandas, glob, tabula. Модуль tabula может извлекать таблицы из PDF файлов, glob создаст для нас список PDF файлов в папке, отфильтровав все остальные, а pandas почистит полученные таблицы.

Примененные функции

  • glob.glob
  • tabula.read_pdf
  • display
  • pandas.DataFrame
  • pandas.concat
  • pandas.to_datetime
  • pandas.DataFrame.dropna
  • pandas.Series.str.replace
  • pandas.DataFrame.to_csv
  • pandas.Series
  • pandas.Series.repeat
  • len
  • pandas.DataFrame.append
  • pandas.DataFrame.reset_index
  • pandas.Series.rename

Код

Список всех PDF файлов в рабочей папке

 pdf_files = glob.glob('*.pdf') pdf_files
 pdf_tables = tabula.read_pdf(pdf_files[1], pages='all', multiple_tables=True, lattice=True)

Объединить таблицы одного PDF

 df_single = pd.DataFrame() for table in pdf_tables: if table.columns[0] == 'Дата': df_single = pd.concat([df_single, table]) elif table.iloc[0,0] == 'Дата': table.columns = table.iloc[0] df_single = pd.concat([df_single, table]) else: continue

Преобразовать тип данных столбца «Дата» в datetime.

 df_single['Дата'] =pd.to_datetime(df_single['Дата'], format='%d.%m.%Y', errors='coerce')

Удалить строки, где в столбце «Дата» значения null.

Курс Импорт данных в Python

Номер урока Урок Описание
1 Python Импорт данных №1. Импорт Excel Научимся импортировать данные из книг MS Excel в формате xlsx.
2 Python Импорт данных №2. Импорт CSV Научимся импортировать данные из текстовых файлов CSV.
3 Python Импорт данных №3. Импорт с веб-сайта (HTML) Импортируем таблицу с веб-страницы и запишем результат в CSV файл.
4 Python Импорт данных №4. Импорт таблиц XML Научимся импортировать таблицы XML на примере данных с сайта Банка России.
5 Python Импорт данных №5. Импорт таблиц из PDF Научимся импортировать нужные таблицы из PDF файлов, объединять их по вертикали в одну большую таблицу и записывать результат в CSV файл.
6 Python Импорт данных №6. Импорт таблиц из Word Научимся импортировать таблицы из документов MS Word в формате docx.
7 Python Импорт данных №7. Импорт таблиц из Word В этом уроке мы извлечем таблицу из документа Word и запишем ее в файл CSV. Для этого нам понадобится модули python-docx и pandas.

Источник

How to Extract Data from PDF Files with Python

Shittu Olumide

Shittu Olumide

How to Extract Data from PDF Files with Python

Data is present in all areas of the modern digital world, and it takes many different forms.

One of the most common formats for data is PDF. Invoices, reports, and other forms are frequently stored in Portable Document Format (PDF) files by businesses and institutions.

Читайте также:  Как работает new python

It can be laborious and time-consuming to extract data from PDF files. Fortunately, for easy data extraction from PDF files, Python provides a variety of libraries.

This tutorial will explain how to extract data from PDF files using Python. You’ll learn how to install the necessary libraries and I’ll provide examples of how to do so.

There are several Python libraries you can use to read and extract data from PDF files. These include PDFMiner, PyPDF2, PDFQuery and PyMuPDF. Here, we will use PDFQuery to read and extract data from multiple PDF files.

How to Use PDFQuery

PDFQuery is a Python library that provides an easy way to extract data from PDF files by using CSS-like selectors to locate elements in the document.

It reads a PDF file as an object, converts the PDF object to an XML file, and accesses the desired information by its specific location inside of the PDF document.

Let’s consider a short example to see how it works.

from pdfquery import PDFQuery pdf = PDFQuery('example.pdf') pdf.load() # Use CSS-like selectors to locate the elements text_elements = pdf.pq('LTTextLineHorizontal') # Extract the text from the elements text = [t.text for t in text_elements] print(text) 

In this code, we first create a PDFQuery object by passing the filename of the PDF file we want to extract data from. We then load the document into the object by calling the load() method.

Next, we use CSS-like selectors to locate the text elements in the PDF document. The pq() method is used to locate the elements, which returns a PyQuery object that represents the selected elements.

Finally, we extract the text from the elements by accessing the text attribute of each element and we store the extracted text in a list called text .

Let’s consider another method we can use to read PDF files, extract some data elements, and create a structured dataset using PDFQuery. We will follow the following steps:

  • Package installation.
  • Import the libraries.
  • Read and convert the PDF files.
  • Access and extract the Data.

Package installation

First, we need to install PDFQuery and also install Pandas for some analysis and data presentation.

pip install pdfquery pip install pandas 

Import the libraries

import pandas as pd import pdfquery 

We import the two libraries to be be able to use them in our project.

Read and convert the PDF files

#read the PDF pdf = pdfquery.PDFQuery('customers.pdf') pdf.load() #convert the pdf to XML pdf.tree.write('customers.xml', pretty_print = True) pdf 

We will read the pdf file into our project as an element object and load it. Convert the pdf object into an Extensible Markup Language (XML) file. This file contains the data and the metadata of a given PDF page.

Читайте также:  What is look and feel in java

The XML defines a set of rules for encoding PDF in a format that is readable by humans and machines. Looking at the XML file using a text editor, we can see where the data we want to extract is.

Access and extract the Data

We can get the information we are trying to extract inside the LTTextBoxHorizontal tag, and we can see the metadata associated with it.

The values inside the text box, [68.0, 231.57, 101.990, 234.893] in the XML fragment refers to Left, Bottom, Right, Top coordinates of the text box. You can think of this as the boundaries around the data we want to extract.

Let’s access and extract the customer name using the coordinates of the text box.

# access the data using coordinates customer_name = pdf.pq('LTTextLineHorizontal:in_bbox("68.0, 231.57, 101.990, 234.893")').text() print(customer_name) #output: Brandon James 

Note: Sometimes the data we want to extract is not in the exact same location in every file which can cause issues. Fortunately, PDFQuery can also query tags that contain a given string.

Conclusion

Data extraction from PDF files is a crucial task because these files are frequently used for document storage and sharing.

Python’s PDFQuery is a potent tool for extracting data from PDF files. Anyone looking to extract data from PDF files will find PDFQuery to be a great option thanks to its simple syntax and comprehensive documentation. It is also open-source and can be modified to suit specific use cases.

Let’s connect on Twitter and on LinkedIn. You can also subscribe to my YouTube channel.

Источник

How to Extract Table from PDF with Python and Pandas

In this short tutorial, we’ll see how to extract tables from PDF files with Python and Pandas.

We will cover two cases of table extraction from PDF:

(1) Simple table with tabula-py

from tabula import read_pdf df_temp = read_pdf('china.pdf') 

(2) Table with merged cells

import pandas as pd html_tables = pd.read_html(page) 

Let’s cover both examples in more detail as context is important.

1: Extract tables from PDF with Python

In this example we will extract multiple tables from remote PDF file: china.pdf.

We will use library called: tabula-py which can be installed by:

The .pdf file contains 2 table:

from tabula import read_pdf file = 'https://raw.githubusercontent.com/tabulapdf/tabula-java/master/src/test/resources/technology/tabula/china.pdf' df_temp = read_pdf(file, stream=True) 

After reading the data we can get a list of DataFrames which contain table data.

FLA Audit Profile Unnamed: 0
0 Country China
1 Factory name 01001523B
2 IEM BVCPS (HK), Shen Zhen Office
3 Date of audit May 20-22, 2003
4 PC(s) adidas-Salomon
5 Number of workers 243
6 Product(s) Scarf, cap, gloves, beanies and headbands
7 Production processes Sewing, cutting, packing, embroidery, die-cutting

Which is the exact match of the first table from the PDF file.

read-pdf-table-python-tabula

While the second one is a bit weird. The reason is because of the merged cells which are extracted as NaN values:

Читайте также:  Выделение кода в css
Unnamed: 0 Unnamed: 1 Unnamed: 2 Findings Unnamed: 3
0 FLA Code/ Compliance issue Legal Reference / Country Law FLA Benchmark Monitor’s Findings NaN
1 1. Code Awareness NaN NaN NaN NaN
2 2. Forced Labor NaN NaN NaN NaN
3 3. Child Labor NaN NaN NaN NaN
4 4. Harassment or Abuse NaN NaN NaN NaN

read-pdf-table-python-tabula-merged-cells

How to workaround this problem we will see in the next step.
Some cells are extracted to multiple rows as we can see from the image:

2: Extract tables from PDF — keep format

Often tables in PDF files have:

Most libraries and software are not able to extract them in a reliable way.

To extract complex table from PDF files with Python and Pandas we will do:

  • download the file (it’s possible without download)
  • convert the PDF file to HTML
  • extract the tables with Pandas

2.1 Convert PDF to HTML

First we will download the file from: china.pdf.

Then we will convert it to HTML with the library: pdftotree.

import pdftotree page = pdftotree.parse('china.pdf', html_path=None, model_type=None, model_path=None, visualize=False) 

library can be installed by:

2.2 Extract tables with Pandas

Finally we can read all the tables from this page with Pandas:

import pandas as pd html_tables = pd.read_html(page) html_tables[1] 

Which will give us better results in comparison to tabula-py

read-pdf-table-python-pandas-merged-cells

2.3 HTMLTableParser

As alternatively to Pandas, we can use the library: html-table-parser-python3 to parse the HTML tables to Python lists.

from html_table_parser.parser import HTMLTableParser p = HTMLTableParser() p.feed(page) print(p.tables[0]) 

it convert the HTML table to Python list:

[['', ''], ['Country', 'China'], ['Factory name', '01001523B'], ['IEM', 'BVCPS (HK), Shen Zhen Office'], ['Date of audit', 'May 20-22, 2003'], ['PC(s)', 'adidas-Salomon'], ['Number of workers', '243'], ['Product(s)', 'Scarf, cap, gloves, beanies and headbands']] 

Now we can convert the list to Pandas DataFrame:

import pandas as pd pd.DataFrame(p.tables[1]) 

To install this library we can do:

pip install html-table-parser-python3 

There are two differences to Pandas:

3. Python Libraries for extraction from PDF files

Finally let’s find a list of useful Python libraries which can help in PDF parsing and extraction:

3.1 Python PDF parsing

3.2 Parse HTML tables

  • html-table-parser-python3 — parse HTML tables with Python 3 to list of values
  • tablextract — extracts the information represented in any HTML table
  • pdftotree — convert PDF into hOCR with text, tables, and figures being recognized and preserved.
  • pandas.read_html
  • html-table-extractor — A python library for extracting data from html table
  • py-html-table — Python library to extract data from HTML Tables with rowspan

3.3 Example PDF files

Finally you can find example PDF files where you can test table extraction with Python and Pandas:

By using DataScientYst — Data Science Simplified, you agree to our Cookie Policy.

Источник

Оцените статью