Просмотр файл doc на php

Extract text from doc and docx

I would like to know how can I read the contents of a doc or docx. I’m using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me know, as long as it works under a linux webserver.

9 Answers 9

Here i have added the solution to get the text from .doc,.docx word files

How to extract text from word file .doc,docx php

For .doc

private function read_doc() < $fileHandle = fopen($this->filename, "r"); $line = @fread($fileHandle, filesize($this->filename)); $lines = explode(chr(0x0D),$line); $outtext = ""; foreach($lines as $thisline) < $pos = strpos($thisline, chr(0x00)); if (($pos !== FALSE)||(strlen($thisline)==0)) < >else < $outtext .= $thisline." "; >> $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext); return $outtext; > 

For .docx

private function read_docx()< $striped_content = ''; $content = ''; $zip = zip_open($this->filename); if (!$zip || is_numeric($zip)) return false; while ($zip_entry = zip_read($zip)) < if (zip_entry_open($zip, $zip_entry) == FALSE) continue; if (zip_entry_name($zip_entry) != "word/document.xml") continue; $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry)); zip_entry_close($zip_entry); >// end while zip_close($zip); $content = str_replace('', " ", $content); $content = str_replace('', "\r\n", $content); $striped_content = strip_tags($content); return $striped_content; > 

Thank you .doc files are working fine. But .docx files are not working. I used the above code. Mime type of my .docx file is shows ‘application/msword’. Am I missing anything to add?

This is a .DOCX solution only. For .DOC or .PDF you’ll need to use something else like pdf2text.php for PDF

function docx2text($filename) < return readZippedXML($filename, "word/document.xml"); >function readZippedXML($archiveFile, $dataFile) < // Create new ZIP archive $zip = new ZipArchive; // Open received archive file if (true === $zip->open($archiveFile)) < // If done, search for the data file in the archive if (($index = $zip->locateName($dataFile)) !== false) < // If found, read it to the string $data = $zip->getFromIndex($index); // Close archive file $zip->close(); // Load XML from a string // Skip errors and warnings $xml = new DOMDocument(); $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING); // Return data without XML formatting tags return strip_tags($xml->saveXML()); > $zip->close(); > // In case of failure return empty string return ""; > echo docx2text("test.docx"); // Save this contents to file 

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

A pure PHP library for reading and writing word processing documents

License

Unknown and 2 other licenses found

Licenses found

PHPOffice/PHPWord

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Latest Stable Version Code Quality Code Coverage Total Downloads License

PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), Rich Text Format (RTF), HTML, and PDF.

PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading the Developers’ Documentation.

If you have any questions, please ask on StackOverFlow

With PHPWord, you can create OOXML, ODF, or RTF documents dynamically using your PHP scripts. Below are some of the things that you can do with PHPWord library:

  • Set document properties, e.g. title, subject, and creator.
  • Create document sections with different settings, e.g. portrait/landscape, page size, and page numbering
  • Create header and footer for each sections
  • Set default font type, font size, and paragraph style
  • Use UTF-8 and East Asia fonts/characters
  • Define custom font styles (e.g. bold, italic, color) and paragraph styles (e.g. centered, multicolumns, spacing) either as named style or inline in text
  • Insert paragraphs, either as a simple text or complex one (a text run) that contains other elements
  • Insert titles (headers) and table of contents
  • Insert text breaks and page breaks
  • Insert and format images, either local, remote, or as page watermarks
  • Insert binary OLE Objects such as Excel or Visio
  • Insert and format table with customized properties for each rows (e.g. repeat as header row) and cells (e.g. background color, rowspan, colspan)
  • Insert list items as bulleted, numbered, or multilevel
  • Insert hyperlinks
  • Insert footnotes and endnotes
  • Insert drawing shapes (arc, curve, line, polyline, rect, oval)
  • Insert charts (pie, doughnut, bar, line, area, scatter, radar)
  • Insert form fields (textinput, checkbox, and dropdown)
  • Create document from templates
  • Use XSL 1.0 style sheets to transform headers, main document part, and footers of an OOXML template
  • . and many more features on progress

PHPWord requires the following:

  • PHP 7.1+
  • XML Parser extension
  • Laminas Escaper component
  • Zip extension (optional, used to write OOXML and ODF)
  • GD extension (optional, used to add images)
  • XMLWriter extension (optional, used to write OOXML and ODF)
  • XSL extension (optional, used to apply XSL style sheet to template )
  • dompdf library (optional, used to write PDF)

PHPWord is installed via Composer. To add a dependency to PHPWord in your project, either

Run the following to use the latest stable version

composer require phpoffice/phpword

or if you want the latest unreleased version

composer require phpoffice/phpword:dev-master

The following is a basic usage example of the PHPWord library.

 require_once 'bootstrap.php'; // Creating the new document. $phpWord = new \PhpOffice\PhpWord\PhpWord(); /* Note: any element you append to a document must reside inside of a Section. */ // Adding an empty Section to the document. $section = $phpWord->addSection(); // Adding Text element to the Section having font styled by default. $section->addText( '"Learn from yesterday, live for today, hope for tomorrow. ' . 'The important thing is not to stop questioning." ' . '(Albert Einstein)' ); /* * Note: it's possible to customize font style of the Text element you add in three ways: * - inline; * - using named font style (new font style object will be implicitly created); * - using explicitly created font style object. */ // Adding Text element with font customized inline. $section->addText( '"Great achievement is usually born of great sacrifice, ' . 'and is never the result of selfishness." ' . '(Napoleon Hill)', array('name' => 'Tahoma', 'size' => 10) ); // Adding Text element with font customized using named font style. $fontStyleName = 'oneUserDefinedStyle'; $phpWord->addFontStyle( $fontStyleName, array('name' => 'Tahoma', 'size' => 10, 'color' => '1B2232', 'bold' => true) ); $section->addText( '"The greatest accomplishment is not in never falling, ' . 'but in rising again after you fall." ' . '(Vince Lombardi)', $fontStyleName ); // Adding Text element with font customized using explicitly created font style object. $fontStyle = new \PhpOffice\PhpWord\Style\Font(); $fontStyle->setBold(true); $fontStyle->setName('Tahoma'); $fontStyle->setSize(13); $myTextElement = $section->addText('"Believe you can and you\'re halfway there." (Theodor Roosevelt)'); $myTextElement->setFontStyle($fontStyle); // Saving the document as OOXML file. $objWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, 'Word2007'); $objWriter->save('helloWorld.docx'); // Saving the document as ODF file. $objWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, 'ODText'); $objWriter->save('helloWorld.odt'); // Saving the document as HTML file. $objWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, 'HTML'); $objWriter->save('helloWorld.html'); /* Note: we skip RTF, because it's not XML-based and requires a different example. */ /* Note: we skip PDF, because "HTML-to-PDF" approach is used to create PDF documents. */

More examples are provided in the samples folder. For an easy access to those samples launch php -S localhost:8000 in the samples directory then browse to http://localhost:8000 to view the samples. You can also read the Developers’ Documentation for more detail.

We welcome everyone to contribute to PHPWord. Below are some of the things that you can do to contribute.

  • Read our contributing guide.
  • Fork us and request a pull to the master branch.
  • Submit bug reports or feature requests to GitHub.
  • Follow @PHPWord and @PHPOffice on Twitter.

Источник

read word document in php

strip_tags() will remove all the xml that contain inline-style and or class; you would need those and interpret/apply these in some ways to restore the styling.

«PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats.» (PHPOffice, 2016)

This open php library should solve your problem. you can eighter download it oder get it by composer:

The following is a similar function to the one in @suhdir’s answer, but for PHP 8:

 function readDocx($filename) < $zip = new ZipArchive(); if ($zip->open($filename)) < $content = $zip->getFromName("word/document.xml"); $zip->close(); $content = str_replace('', " ", $content); $content = str_replace('', "\r\n", $content); return strip_tags($content); > return false; > 

Zip functions have been deprecated in PHP 8 and substituted by ZipArchive.

I am using PHP7 and got the deprecated warning for @Sudhir’s answer. Also, I tried phpWord and it didn’t work with my word files created by MS Word or Google Docs. This short code just worked for both. This should be marked as the answer, thank you.

«docx» is different from «doc». Docx files are basically xml files in a zipfile container (as described by wikipedia). Doc files are binary blobs.

I am aware of no library that can easily read docx files in php (although Phpdocx can write them). However, since these are just zip files and xml files, you should be able do put something together using ZipArchive to open the docx container and DOMDocument or SimpleXML or XMLReader or XSLTProcessor to read the xml documents themselves.

Word document isn’t stored conveniently like a text file (it’s more like xml / binary file), so you can’t just use echo and expects it to output the human readable portion of the docx file.

There’s a library that could do what you want, but it takes only doc file

This question is in a collective: a subcommunity defined by tags with relevant content and experts.

Источник

Читайте также:  Php warning include once failed to open stream permission denied in
Оцените статью