- Extract text from doc and docx
- 9 Answers 9
- How to extract text from word file .doc,docx php
- For .doc
- For .docx
- Saved searches
- Use saved searches to filter your results more quickly
- License
- Licenses found
- PHPOffice/PHPWord
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- read word document in php
Extract text from doc and docx
I would like to know how can I read the contents of a doc or docx. I’m using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me know, as long as it works under a linux webserver.
9 Answers 9
Here i have added the solution to get the text from .doc,.docx word files
How to extract text from word file .doc,docx php
For .doc
private function read_doc() < $fileHandle = fopen($this->filename, "r"); $line = @fread($fileHandle, filesize($this->filename)); $lines = explode(chr(0x0D),$line); $outtext = ""; foreach($lines as $thisline) < $pos = strpos($thisline, chr(0x00)); if (($pos !== FALSE)||(strlen($thisline)==0)) < >else < $outtext .= $thisline." "; >> $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext); return $outtext; >
For .docx
private function read_docx()< $striped_content = ''; $content = ''; $zip = zip_open($this->filename); if (!$zip || is_numeric($zip)) return false; while ($zip_entry = zip_read($zip)) < if (zip_entry_open($zip, $zip_entry) == FALSE) continue; if (zip_entry_name($zip_entry) != "word/document.xml") continue; $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry)); zip_entry_close($zip_entry); >// end while zip_close($zip); $content = str_replace('', " ", $content); $content = str_replace('', "\r\n", $content); $striped_content = strip_tags($content); return $striped_content; >
Thank you .doc files are working fine. But .docx files are not working. I used the above code. Mime type of my .docx file is shows ‘application/msword’. Am I missing anything to add?
This is a .DOCX solution only. For .DOC or .PDF you’ll need to use something else like pdf2text.php for PDF
function docx2text($filename) < return readZippedXML($filename, "word/document.xml"); >function readZippedXML($archiveFile, $dataFile) < // Create new ZIP archive $zip = new ZipArchive; // Open received archive file if (true === $zip->open($archiveFile)) < // If done, search for the data file in the archive if (($index = $zip->locateName($dataFile)) !== false) < // If found, read it to the string $data = $zip->getFromIndex($index); // Close archive file $zip->close(); // Load XML from a string // Skip errors and warnings $xml = new DOMDocument(); $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING); // Return data without XML formatting tags return strip_tags($xml->saveXML()); > $zip->close(); > // In case of failure return empty string return ""; > echo docx2text("test.docx"); // Save this contents to file
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
A pure PHP library for reading and writing word processing documents
License
Unknown and 2 other licenses found
Licenses found
PHPOffice/PHPWord
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), Rich Text Format (RTF), HTML, and PDF.
PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading the Developers’ Documentation.
If you have any questions, please ask on StackOverFlow
With PHPWord, you can create OOXML, ODF, or RTF documents dynamically using your PHP scripts. Below are some of the things that you can do with PHPWord library:
- Set document properties, e.g. title, subject, and creator.
- Create document sections with different settings, e.g. portrait/landscape, page size, and page numbering
- Create header and footer for each sections
- Set default font type, font size, and paragraph style
- Use UTF-8 and East Asia fonts/characters
- Define custom font styles (e.g. bold, italic, color) and paragraph styles (e.g. centered, multicolumns, spacing) either as named style or inline in text
- Insert paragraphs, either as a simple text or complex one (a text run) that contains other elements
- Insert titles (headers) and table of contents
- Insert text breaks and page breaks
- Insert and format images, either local, remote, or as page watermarks
- Insert binary OLE Objects such as Excel or Visio
- Insert and format table with customized properties for each rows (e.g. repeat as header row) and cells (e.g. background color, rowspan, colspan)
- Insert list items as bulleted, numbered, or multilevel
- Insert hyperlinks
- Insert footnotes and endnotes
- Insert drawing shapes (arc, curve, line, polyline, rect, oval)
- Insert charts (pie, doughnut, bar, line, area, scatter, radar)
- Insert form fields (textinput, checkbox, and dropdown)
- Create document from templates
- Use XSL 1.0 style sheets to transform headers, main document part, and footers of an OOXML template
- . and many more features on progress
PHPWord requires the following:
- PHP 7.1+
- XML Parser extension
- Laminas Escaper component
- Zip extension (optional, used to write OOXML and ODF)
- GD extension (optional, used to add images)
- XMLWriter extension (optional, used to write OOXML and ODF)
- XSL extension (optional, used to apply XSL style sheet to template )
- dompdf library (optional, used to write PDF)
PHPWord is installed via Composer. To add a dependency to PHPWord in your project, either
Run the following to use the latest stable version
composer require phpoffice/phpword
or if you want the latest unreleased version
composer require phpoffice/phpword:dev-master
The following is a basic usage example of the PHPWord library.
require_once 'bootstrap.php'; // Creating the new document. $phpWord = new \PhpOffice\PhpWord\PhpWord(); /* Note: any element you append to a document must reside inside of a Section. */ // Adding an empty Section to the document. $section = $phpWord->addSection(); // Adding Text element to the Section having font styled by default. $section->addText( '"Learn from yesterday, live for today, hope for tomorrow. ' . 'The important thing is not to stop questioning." ' . '(Albert Einstein)' ); /* * Note: it's possible to customize font style of the Text element you add in three ways: * - inline; * - using named font style (new font style object will be implicitly created); * - using explicitly created font style object. */ // Adding Text element with font customized inline. $section->addText( '"Great achievement is usually born of great sacrifice, ' . 'and is never the result of selfishness." ' . '(Napoleon Hill)', array('name' => 'Tahoma', 'size' => 10) ); // Adding Text element with font customized using named font style. $fontStyleName = 'oneUserDefinedStyle'; $phpWord->addFontStyle( $fontStyleName, array('name' => 'Tahoma', 'size' => 10, 'color' => '1B2232', 'bold' => true) ); $section->addText( '"The greatest accomplishment is not in never falling, ' . 'but in rising again after you fall." ' . '(Vince Lombardi)', $fontStyleName ); // Adding Text element with font customized using explicitly created font style object. $fontStyle = new \PhpOffice\PhpWord\Style\Font(); $fontStyle->setBold(true); $fontStyle->setName('Tahoma'); $fontStyle->setSize(13); $myTextElement = $section->addText('"Believe you can and you\'re halfway there." (Theodor Roosevelt)'); $myTextElement->setFontStyle($fontStyle); // Saving the document as OOXML file. $objWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, 'Word2007'); $objWriter->save('helloWorld.docx'); // Saving the document as ODF file. $objWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, 'ODText'); $objWriter->save('helloWorld.odt'); // Saving the document as HTML file. $objWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, 'HTML'); $objWriter->save('helloWorld.html'); /* Note: we skip RTF, because it's not XML-based and requires a different example. */ /* Note: we skip PDF, because "HTML-to-PDF" approach is used to create PDF documents. */
More examples are provided in the samples folder. For an easy access to those samples launch php -S localhost:8000 in the samples directory then browse to http://localhost:8000 to view the samples. You can also read the Developers’ Documentation for more detail.
We welcome everyone to contribute to PHPWord. Below are some of the things that you can do to contribute.
- Read our contributing guide.
- Fork us and request a pull to the master branch.
- Submit bug reports or feature requests to GitHub.
- Follow @PHPWord and @PHPOffice on Twitter.
read word document in php
strip_tags() will remove all the xml that contain inline-style and or class; you would need those and interpret/apply these in some ways to restore the styling.
«PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats.» (PHPOffice, 2016)
This open php library should solve your problem. you can eighter download it oder get it by composer:
The following is a similar function to the one in @suhdir’s answer, but for PHP 8:
function readDocx($filename) < $zip = new ZipArchive(); if ($zip->open($filename)) < $content = $zip->getFromName("word/document.xml"); $zip->close(); $content = str_replace('', " ", $content); $content = str_replace('', "\r\n", $content); return strip_tags($content); > return false; >
Zip functions have been deprecated in PHP 8 and substituted by ZipArchive.
I am using PHP7 and got the deprecated warning for @Sudhir’s answer. Also, I tried phpWord and it didn’t work with my word files created by MS Word or Google Docs. This short code just worked for both. This should be marked as the answer, thank you.
«docx» is different from «doc». Docx files are basically xml files in a zipfile container (as described by wikipedia). Doc files are binary blobs.
I am aware of no library that can easily read docx files in php (although Phpdocx can write them). However, since these are just zip files and xml files, you should be able do put something together using ZipArchive to open the docx container and DOMDocument or SimpleXML or XMLReader or XSLTProcessor to read the xml documents themselves.
Word document isn’t stored conveniently like a text file (it’s more like xml / binary file), so you can’t just use echo and expects it to output the human readable portion of the docx file.
There’s a library that could do what you want, but it takes only doc file
This question is in a collective: a subcommunity defined by tags with relevant content and experts.