Pdf в ворд python

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Open source Python library converting pdf to docx.

License

dothinking/pdf2docx

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

  • Extract data from PDF with PyMuPDF , e.g. text, images and drawings
  • Parse layout with rule, e.g. sections, paragraphs, images and tables
  • Generate docx with python-docx
  • Parse and re-create page layout
    • page margin
    • section and column (1 or 2 columns only)
    • page header and footer [TODO]
    • OCR text [TODO]
    • text in horizontal/vertical direction: from left to right, from bottom to top
    • font style, e.g. font name, size, weight, italic and color
    • text format, e.g. highlight, underline, strike-through
    • list style [TODO]
    • external hyper link
    • paragraph horizontal alignment (left/right/center/justify) and vertical spacing
    • in-line image
    • image in Gray/RGB/CMYK mode
    • transparent image
    • floating image, i.e. picture behind text
    • border style, e.g. width, color
    • shading style, i.e. background color
    • merged cells
    • vertical direction cell
    • table with partly hidden borders
    • nested tables

    It can also be used as a tool to extract table contents since both table content and format/style is parsed.

    • Text-based PDF file
    • Left to right language
    • Normal reading direction, no word transformation / rotation
    • Rule-based method can’t 100% convert the PDF layout

    Источник

    How to Convert PDF to Word With Python: A Step-by-Step Guide

    convert pdf to word using python

    The need to convert PDF to Word has seen the inception of lots of standalone PDF converter programs that you can find solace in. While we have mostly been accustomed to PDF converter programs, other methods that do not necessarily require the use of these fully-fledged apps are also available. That brings us to the topic of this article that dives into the popular Python programming language world to explore its capabilities in the process to convert PDF to Word.

    To be specific, we will be looking at the top packages that you can obtain from the Python library and get to convert PDF to Word in a way that is not time-consuming. I am sure that Python enthusiasts are already excited about this but that does not necessarily mean that other users cannot get started with this method.

    In fact, we have gone the extra mile to bring you comprehensive step by step guides that anyone can get started with regardless of how conversant you are with the Python language when you need to save PDF as Word.

    Before you can get started with any of these methods, you have to meet one major prerequisite that requires you to have Python set up on your computer – just grab the installation package from https://www.python.org/downloads/ and install it just like any normal software. Once you are done with that, here are the guides on how to convert PDF to Word with Python that you can put into practice.

    This Tutorial Covers

    Method #1). Convert PDF Files to Word Using PyPDF2 Python Library

    PyPDF2 is one of the packages from the Python library that comes in handy when you are looking to extract data from PDF files. Basically, it works in a situation where you have a PDF, it has text that you need, and through Python, extracting it is made convenient. One thing to note though is that this library only extracts text so do not expect your rich media content to be extracted.

    The good thing is that by extracting the text into a Word file, you get to save a great deal of both the time and effort that would have been used to retype the content. Having said that, you must be wondering how exactly the PDF text extraction is done with PyPDF2. You need not worry anymore as below is a comprehensive tutorial on how to achieve that.

    Step 1: Create a folder and in it place the PDF file. Just pick your preferred location to create the folder with a relevant name and this will ensure that you do not hassle much to locate your files. Do not forget to place the target PDF file in this folder too.

    Step 2: Install the PyPDF2 package. To do this, open a command window in the folder you created and run the command below. A convenient way to open the command window is by just typing the word “cmd” in the file explorer address bar and then hit the “Enter” key. Next, type in this command hit the “Enter” key, and wait for the installation to complete.

    Step 3: Create a Python script to extract data from PDF. Using your preferred text editor, create a new Python script file with the “.py” file extension and then paste the code below in it. Make sure to change the highlighted “filename” with the name of the target PDF file lest you run into errors of the file not being found. Remember to save the changes. For convenience, you can save the script file as “pypdf2.py”.

    FILE_PATH = ‘filename.pdf’

    with open(FILE_PATH, mode=’rb’) as f:

    Step 4: Run the script to extract data from PDF to Word. Now that the script has been created, we are ready to save PDF as Word. We are looking to convert PDF to Word and therefore we will extend the command to not only extract the data but also save the same in a Word document. Type in the command below and hit the “Enter” key.

    pypdf2.py > filename.doc

    Again, you can change the highlighted “filename” to any name you prefer for the converted Word document. It is convenient to just rock the source filename since the file extension will be different. The moment you run the command, you should be able to find a new Word document in the directory you created ready for opening.

    Step 5: View the Word document. Simply open the output Doc file to view the extracted data and from there apply further necessary actions as you see fit.

    As you may have realized, working with PyPDF2 is not as complex as you may have thought since it involves a very minimal code that is just a matter of a simple copy and paste process. And now that you have a step by step tutorial to guide you, the task should be flawless to the greatest extent.

    Method #2). Convert PDFs to Word Using GroupDocs Python SDK

    GroupDocs is a Cloud SDK for Python that will help you convert PDF to Word in one of the easiest and most convenient ways as long as you are able to follow this simple guide provided here. Unlike the previous PyPDF2 library, GroupDocs is capable of processing a richly formatted PDF in a way that it retains the original formatting in the converted Word file. It performs all this as a standalone tool without the need for any other extra tools or software.

    Of course, this process to save PDF as Word is not that simple but this Python module will come in handy to kick the ball out of the park for you. The good thing is that we are providing you with a comprehensive and reliable tutorial on how to go about the task at hand even if you are a novice. It is now the perfect moment to let you in on how to convert PDF to Word using GroupDocs Python SDK.

    Step 1: Get your APP SID and APP KEY. Simply sign up for free with https://dashboard.groupdocs.cloud and once you do so, you should be able to find the APP SID and APP KEY in the “My Apps” tab under the “Manage My Apps” sub-tab. These are necessary for the success of the process so have them ready.

    Step 2: Create a directory and place the PDF file in it. For convenience purposes, it is advisable to create a fresh directory with a preferred name and then place the target PDF document inside.

    Step 3: Install the necessary GroupDocs package. To achieve this, we are going to install the “groupdocs-conversion-cloud” Python package using the command line. So, from the file explorer address bar, type in the word “cmd” and hit the “Enter” key. In the resulting command-driven interface, type in the command below and hit the “Enter” key on the keyboard.

    pip install groupdocs-conversion-cloud

    The installation will not take ages to complete and you will be heading to the next step in a matter of moments.

    Step 4: Create the required Python script. In the same directory as the PDF file, create a new “.py” script with the code below that will be responsible for the success of the process to save PDF as Word. Ensure that you replace the highlighted “app_sid” and “app_key” values with what you were assigned when you signed up. At the same time, ensure you replace the highlighted “filename” with the name of your PDF file. For convenience, save the Python script with the filename “groupdocs.py

    # Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).

    # Create instance of the API

    convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)

    file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)

    #upload source file to storage

    filename = ‘filename.pdf’

    remote_name = ‘filename.pdf’

    output_name= ‘filename.docx’

    #Extract Text from PDF document

    print(«Document converted successfully: » + str(response))

    except groupdocs_conversion_cloud.ApiException as e:

    print(«Exception when calling get_supported_conversion_types: «.format(e.message))

    Double-check that you have made the necessary changes as required before saving the Python script. Once you have confirmed the highlights, head over to the next step. As a quick summary, this script will import the needed Python package, initialize the API, upload the source PDF file, convert the PDF to Word, and then deliver the output information.

    Step 5: Run the Python script. It is now time to let the script work its magic by running it from the command line that we opened. To do that, type the command below, hit the “Enter” key, and be patient for it to complete.

    To confirm that the process has completed as expected, you should be able to see some output information with a successful conversion message, the file path, size, and Url.

    Step 6: Download the converted Word document. On your preferred web browser, navigate to https://dashboard.groupdocs.cloud/ and then head over to the “My Files” tab. This will open up your “Storage” and list all the files available and it is here that you will find the uploaded PDF file and the converted Word file. Simply tick the box on the left-hand side of the DOCX file and then hit the “Download” button. You will be asked to confirm whether you are sure you want to download the checked files and all you need to do here is click the “Yes” button.

    The download process will start momentarily after which you will be able to find the Word document in your default downloads folder. From there, you can open the file and perform further actions that you deem necessary. This will mark the end of your task to save PDF as Word using the GroupDocs Python SDK.

    At no cost at all, GroupDocs has delivered a reliable method at your disposal that will help you extract data from PDF files and on top of that retain the original layout and formatting to the highest degree. This means that you can say goodbye to the need for corrections after the conversion process besides enjoying a very efficient process.

    At the end of the day, you can comfortably take advantage of the Python programming language anytime the need to convert PDF to Word arises, all thanks to these awesome libraries featured in this article. The guides on how to tackle the task at hand ensure that you do not encounter a steep learning curve when you decide to make the most out of these tools.

    Therefore, any moment you feel you need to save a PDF as Word, you need not necessary hassle looking for fully-fledged software when you can do that easily using Python. Pick the one tool that has proven to lace your shoes in the best way, follow the guide on how to use it, and sail your way to the kind of results you are looking forward to.

    Источник

    Читайте также:  Меняем фон сайта с помощью HTML - Нубекс
Оцените статью