Ocr library in python

pytesseract 0.3.10

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.

Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

USAGE

Note: Test images are located in the tests/data folder of the Git repo.

Support for OpenCV image/NumPy array objects
If you need custom configuration like oem / psm , use the config keyword.
Add the following config, if you have tessdata error like: “Error opening data file…”
  • get_languages Returns all currently supported languages by Tesseract OCR.
  • get_tesseract_version Returns the Tesseract version installed in the system.
  • image_to_string Returns unmodified output as string from Tesseract OCR processing
  • image_to_boxes Returns result containing recognized characters and their box boundaries
  • image_to_data Returns result containing box boundaries, confidences, and other information. Requires Tesseract 3.05+. For more information, please check the Tesseract TSV documentation
  • image_to_osd Returns result containing information about orientation and script detection.
  • image_to_alto_xml Returns result in the form of Tesseract’s ALTO XML format.
  • run_and_get_output Returns the raw output from Tesseract OCR. Gives a bit more control over the parameters that are sent to tesseract.

image_to_data(image, lang=None, config=», nice=0, output_type=Output.STRING, timeout=0, pandas_config=None)

  • image Object or String — PIL Image/NumPy array or file path of the image to be processed by Tesseract. If you pass object instead of file path, pytesseract will implicitly convert the image to RGB mode.
  • lang String — Tesseract language code string. Defaults to eng if not specified! Example for multiple languages: lang=’eng+fra’
  • config String — Any additional custom configuration flags that are not available via the pytesseract function. For example: config=’—psm 6′
  • nice Integer — modifies the processor priority for the Tesseract run. Not supported on Windows. Nice adjusts the niceness of unix-like processes.
  • output_type Class attribute — specifies the type of the output, defaults to string . For the full list of all supported types, please check the definition of pytesseract.Output class.
  • timeout Integer or Float — duration in seconds for the OCR processing, after which, pytesseract will terminate and raise RuntimeError.
  • pandas_config Dict — only for the Output.DATAFRAME type. Dictionary with custom arguments for pandas.read_csv. Allows you to customize the output of image_to_data.
pytesseract  lang image_file

INSTALLATION

  • Python-tesseract requires Python 3.6+
  • You will need the Python Imaging Library (PIL) (or the Pillow fork). Under Debian/Ubuntu, this is the package python-imaging or python3-imaging.
  • Install Google Tesseract OCR (additional info how to install the engine on Linux, Mac OSX and Windows). You must be able to invoke the tesseract command as tesseract. If this isn’t the case, for example because tesseract isn’t in your PATH, you will have to change the “tesseract_cmd” variable pytesseract.pytesseract.tesseract_cmd . Under Debian/Ubuntu you can use the package tesseract-ocr. For Mac OS users. please install homebrew package tesseract. Note: In some rare cases, you might need to additionally install tessconfigs and configs from tesseract-ocr/tessconfigs if the OS specific package doesn’t include them.

Check the pytesseract package page for more information.

pip install -U git+https://github.com/madmaze/pytesseract.git
git clone https://github.com/madmaze/pytesseract.git pytesseract  pip install -U .
conda install -c conda-forge pytesseract

TESTING

To run this project’s test suite, install and run tox . Ensure that you have tesseract installed and in your PATH.

LICENSE

Check the LICENSE file included in the Python-tesseract repository/distribution. As of Python-tesseract 0.3.1 the license is Apache License Version 2.0

CONTRIBUTORS

  • Originally written by Samuel Hoffstaetter
  • Juarez Bochi
  • Matthias Lee
  • Lars Kistner
  • Ryan Mitchell
  • Emilio Cecchini
  • John Hagen
  • Darius Morawiec
  • Eddie Bedada
  • Uğurcan Akyüz

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

License

NanoNets/ocr-python

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.rst

This python package is an OCR library which reads all text & tables from image & PDF files using an OCR engine & provides intelligent post-processing options to save OCR results in formats you want.

The package requires Python 3 to run.

pip install ocr-nanonets-wrapper

This software is perpetually free 🙂

You can get your free API key (with unlimited requests) by creating a free account on https://app.nanonets.com/#/keys.

from nanonets import NANONETSOCR model = NANONETSOCR() model.set_token('REPLACE_API_KEY')

You can refer the code shared below or directly use code from here.

# Initialise from nanonets import NANONETSOCR model = NANONETSOCR() # Authenticate # This software is perpetually free :) # You can get your free API key (with unlimited requests) by creating a free account on https://app.nanonets.com/#/keys?utm_source=wrapper. model.set_token('REPLACE_API_KEY') # PDF / Image to Raw OCR Engine Output import json pred_json = model.convert_to_prediction('INPUT_FILE') print(json.dumps(pred_json, indent=2)) # PDF / Image to String string = model.convert_to_string('INPUT_FILE') print(string) # PDF / Image to TXT File model.convert_to_txt('INPUT_FILE', output_file_name = 'OUTPUTNAME.txt') # PDF / Image to Boxes # each element contains predicted word and bounding box information # bounding box information denotes the spatial position of each word in the file boxes = model.convert_to_boxes('test.png') for box in boxes: print(box) # PDF / Image to CSV # This method extracts tables from your file and prints them in a .csv file. # NOTE : This particular function is a trial offering 1000 pages of use. # To use this at scale, please create your own model at app.nanonets.com --> New Model --> Tables. model.convert_to_csv('INPUT_FILE', output_file_name = 'OUTPUTNAME.csv') # PDF / Image to Tables # This method extracts tables from your file and returns a json object. # NOTE : This particular function is a trial offering 1000 pages of use. # To use this at scale, please create your own model at app.nanonets.com --> New Model --> Tables. import json tables_json = model.convert_to_tables('INPUT_FILE') print(json.dumps(tables_json, indent=2)) # PDF / Image to Searchable PDF model.convert_to_searchable_pdf('INPUT_FILE', output_file_name = 'OUTPUTNAME.pdf')

To make getting started easier for you, there is a bunch of sample code along with sample input files.

  • Clone or download the repo and open the /tests folder.
  • all_tests.ipynb is a python notebook containing testing for all methods in the package.
  • convert_to_.py files are python files corresponding to each method in the package individually.

convert_to_string() and convert_to_txt() methods have two optional parameters —

  1. formatting =
  • `lines and spaces` : default, all formatting enabled
  • `none` : space separated text with formatting removed
  • `lines` : space separated text with lines separated with newline character
  • `pages` : list of page wise space separated text
  1. line_threshold =
  • `low` : default
  • `high` : You can add line_threshold=’high’ as a parameter while calling the method which in few cases can improve reading flowcharts and diagrams.

If extracting flat fields, tables and line items from PDFs and images is your use case, I will strongly advice you to create your own model by signing up on app.nanonets.com and using our advanced API. This will improve functionalities, accuracy and response times significantly. Once you have created your account and model, you can use API documentation present here to extract flat fields, tables and line items from any PDF or image.

We help businesses automate Manual Data Entry Using AI and reduce turn around times & manual effort required. More than 1000 enterprises use Nanonets for Intelligent Document Processing. We have generated incredible ROIs for our clients.

We provide OCR and IDP solutions customised for various use cases — invoice automation, Receipt OCR, purchase order automation, accounts payable automation, ID Card OCR and many more.

This software is perpetually free 🙂

About

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

Источник

Читайте также:  Decision tree regressor python
Оцените статью