Rtf to txt python

How to read .rtf file and convert into python3 strings and can be stored in python3 list?

I am having a .rtf file and I want to read the file and store strings into list using python3 by using any package but it should be compatible with both Windows and Linux. I have tried striprtf but read_rtf is not working.

from striprtf.striprtf import rtf_to_text from striprtf.striprtf import read_rtf rtf = read_rtf("file.rtf") text = rtf_to_text(rtf) print(text) 

But in this code, the error is: cannot import name ‘read_rtf’ Please can anyone suggest any way to get strings from .rtf file in python3?

4 Answers 4

with open('yourfile.rtf', 'r') as file: text = file.read() print(text) 

For a super large file, try this:

with open("yourfile.rtf") as infile: for line in infile: do_something_with(line) 

Using rtf_to_text is enough to convert RTFinto a string in Python. Read the content from a RTFfile and then feed it to the rtf_to_text :

from striprtf.striprtf import rtf_to_text with open("yourfile.rtf") as infile: content = infile.read() text = rtf_to_text(content) print(text) 
from striprtf.striprtf import rtf_to_text sample_text = "any text as a string you want" text = rtf_to_text(sample_text) 

Reading RTF file and manipulating the data inside that is tricky, it is depending upon the file you have, Hence I have tried all the above nothing worked, finally, the following code worked for me. Hope it will help those who are hunting for the solution.

from win32com.client import Dispatch word = Dispatch('Word.Application') # Open word application # word = DispatchEx('Word.Application') # start a separate process word.Visible = 0 # Run in the background, no display word.DisplayAlerts = 0 # No warning path = r'C:\Projects\10.1\power.rtf' doc = word.Documents.Open(FileName=path, Encoding='gbk') for para in doc.paragraphs: print(para.Range.Text) doc.Close() word.Quit() 

If you want to store in a single variable, the following code will solve the problem.

from win32com.client import Dispatch word = Dispatch('Word.Application') # Open word application # word = DispatchEx('Word.Application') # start a separate process word.Visible = 0 # Run in the background, no display word.DisplayAlerts = 0 # No warning path = r'C:\Projects\10.1\output_5.rtf' # Write absolute path, relative path will dial wrong doc = word.Documents.Open(FileName=path, Encoding='gbk') #for para in doc.paragraphs: # print(para.Range.Text) content = '\n'.join([para.Range.Text for para in doc.paragraphs]) print(content) doc.Close() word.Quit() 

Источник

Читайте также:  Как добавить package java

How to convert RTF string to Plain text in python using any library [duplicate]

and one more thing i dont want to save this as in .rtf file and convert it to plain text because i do have more data in db as string.

you probably don’t look for this anymore but here’s a library that gets the job done pypi.org/project/striprtf

1 Answer 1

>>> text = "Whatever your rtf text goes here" >>> python striprtf(text) 
 # -*- coding: utf-8 -*- """ Extract text in RTF Files. Refactored to use with Python 3.x Source: http://stackoverflow.com/a/188877 Code created by Markus Jarderot: http://mizardx.blogspot.com """ import re def striprtf(text): pattern = re.compile(r"\\([a-z])(-?\d)?[ ]?|\\'([0-9a-f])|\\([^a-z])|([<>])|[\r\n]+|(.)", re.I) # control words which specify a "destionation". destinations = frozenset(( 'aftncn','aftnsep','aftnsepc','annotation','atnauthor','atndate','atnicn','atnid', 'atnparent','atnref','atntime','atrfend','atrfstart','author','background', 'bkmkend','bkmkstart','blipuid','buptim','category','colorschememapping', 'colortbl','comment','company','creatim','datafield','datastore','defchp','defpap', 'do','doccomm','docvar','dptxbxtext','ebcend','ebcstart','factoidname','falt', 'fchars','ffdeftext','ffentrymcr','ffexitmcr','ffformat','ffhelptext','ffl', 'ffname','ffstattext','field','file','filetbl','fldinst','fldrslt','fldtype', 'fname','fontemb','fontfile','fonttbl','footer','footerf','footerl','footerr', 'footnote','formfield','ftncn','ftnsep','ftnsepc','g','generator','gridtbl', 'header','headerf','headerl','headerr','hl','hlfr','hlinkbase','hlloc','hlsrc', 'hsv','htmltag','info','keycode','keywords','latentstyles','lchars','levelnumbers', 'leveltext','lfolevel','linkval','list','listlevel','listname','listoverride', 'listoverridetable','listpicture','liststylename','listtable','listtext', 'lsdlockedexcept','macc','maccPr','mailmerge','maln','malnScr','manager','margPr', 'mbar','mbarPr','mbaseJc','mbegChr','mborderBox','mborderBoxPr','mbox','mboxPr', 'mchr','mcount','mctrlPr','md','mdeg','mdegHide','mden','mdiff','mdPr','me', 'mendChr','meqArr','meqArrPr','mf','mfName','mfPr','mfunc','mfuncPr','mgroupChr', 'mgroupChrPr','mgrow','mhideBot','mhideLeft','mhideRight','mhideTop','mhtmltag', 'mlim','mlimloc','mlimlow','mlimlowPr','mlimupp','mlimuppPr','mm','mmaddfieldname', 'mmath','mmathPict','mmathPr','mmaxdist','mmc','mmcJc','mmconnectstr', 'mmconnectstrdata','mmcPr','mmcs','mmdatasource','mmheadersource','mmmailsubject', 'mmodso','mmodsofilter','mmodsofldmpdata','mmodsomappedname','mmodsoname', 'mmodsorecipdata','mmodsosort','mmodsosrc','mmodsotable','mmodsoudl', 'mmodsoudldata','mmodsouniquetag','mmPr','mmquery','mmr','mnary','mnaryPr', 'mnoBreak','mnum','mobjDist','moMath','moMathPara','moMathParaPr','mopEmu', 'mphant','mphantPr','mplcHide','mpos','mr','mrad','mradPr','mrPr','msepChr', 'mshow','mshp','msPre','msPrePr','msSub','msSubPr','msSubSup','msSubSupPr','msSup', 'msSupPr','mstrikeBLTR','mstrikeH','mstrikeTLBR','mstrikeV','msub','msubHide', 'msup','msupHide','mtransp','mtype','mvertJc','mvfmf','mvfml','mvtof','mvtol', 'mzeroAsc','mzeroDesc','mzeroWid','nesttableprops','nextfile','nonesttables', 'objalias','objclass','objdata','object','objname','objsect','objtime','oldcprops', 'oldpprops','oldsprops','oldtprops','oleclsid','operator','panose','password', 'passwordhash','pgp','pgptbl','picprop','pict','pn','pnseclvl','pntext','pntxta', 'pntxtb','printim','private','propname','protend','protstart','protusertbl','pxe', 'result','revtbl','revtim','rsidtbl','rxe','shp','shpgrp','shpinst', 'shppict','shprslt','shptxt','sn','sp','staticval','stylesheet','subject','sv', 'svb','tc','template','themedata','title','txe','ud','upr','userprops', 'wgrffmtfilter','windowcaption','writereservation','writereservhash','xe','xform', 'xmlattrname','xmlattrvalue','xmlclose','xmlname','xmlnstbl', 'xmlopen', )) # Translation of some special characters. specialchars = < 'par': '\n', 'sect': '\n\n', 'page': '\n\n', 'line': '\n', 'tab': '\t', 'emdash': '\u2014', 'endash': '\u2013', 'emspace': '\u2003', 'enspace': '\u2002', 'qmspace': '\u2005', 'bullet': '\u2022', 'lquote': '\u2018', 'rquote': '\u2019', 'ldblquote': '\201C', 'rdblquote': '\u201D', >stack = [] ignorable = False # Whether this group (and all inside it) are "ignorable". ucskip = 1 # Number of ASCII characters to skip after a unicode character. curskip = 0 # Number of ASCII characters left to skip out = [] # Output buffer. for match in pattern.finditer(text.decode()): word,arg,hex,char,brace,tchar = match.groups() if brace: curskip = 0 if brace == '': # Pop state ucskip,ignorable = stack.pop() elif char: # \x (not a letter) curskip = 0 if char == '~': if not ignorable: out.append('\xA0') elif char in '<>\\': if not ignorable: out.append(char) elif char == '*': ignorable = True elif word: # \foo curskip = 0 if word in destinations: ignorable = True elif ignorable: pass elif word in specialchars: out.append(specialchars[word]) elif word == 'uc': ucskip = int(arg) elif word == 'u': c = int(arg) if c < 0: c += 0x10000 if c >127: out.append(chr(c)) #NOQA else: out.append(chr(c)) curskip = ucskip elif hex: # \'xx if curskip > 0: curskip -= 1 elif not ignorable: c = int(hex,16) if c > 127: out.append(chr(c)) #NOQA else: out.append(chr(c)) elif tchar: if curskip > 0: curskip -= 1 elif not ignorable: out.append(tchar) return ''.join(out) 

Источник

Читайте также:  Wordpress shortcode php content

Converting .rtf to .txt with Python 3

I have several hundred .rtf files that need to be converted to .txt. I have tried reading and writing the contents of the files into a new text file, but this seems rather tedious. Is there an easier way to do this with python 3? The data in the .rtf files is formatted as a table, and I need to convert it into one long list in the .txt file.

are you only looking for a change of file extension only? If so doing it in bash/cmd is probably easiest. Can be done in python as well of course, plenty of examples around to list/loop over files in a directory, as well as examples of how to rename files with help of python.

2 Answers 2

I found this package: striprtf, it helped me. Sample usage from the docs:

from striprtf.striprtf import rtf_to_text rtf = "some rtf encoded string" text = rtf_to_text(rtf) print(text) 
import os def convert_rtf_to_txt(directory): files = os.listdir(directory) for file in files: if os.path.isfile(os.path.join(directory, file)): filename, extension = os.path.splitext(file) if extension.lower() == ".rtf": rtf_file = open(os.path.join(directory, file), "r") rtf_content = rtf_file.read() rtf_file.close() new_name = f".txt" txt_file = open(os.path.join(directory, new_name), "w") txt_file.write(rtf_content) txt_file.close() os.remove(os.path.join(directory, file)) print("RTF to TXT conversion complete.") directory_path = "D:\\rtf files" convert_rtf_to_txt(directory_path) 

This code converts all RTF files in the specified directory to TXT format by reading the content of each RTF file, creating a corresponding TXT file with the same content, and finally removing the original RTF files.

Your code looks like it works great for changing the extensions of a set of files! Unfortunately, it doesn’t address the question since it doesn’t convert the content of the files. RTF has a specific format that the question asker is looking to convert to plain text. See the Wikipedia page en.wikipedia.org/wiki/Rich_Text_Format

Читайте также:  Язык html позволяет создавать базы данных

Источник

Is there a Python module for converting RTF to plain text? [closed]

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Ideally, I’d like a module or library that doesn’t require superuser access to install; I have limited privileges in my working environment.

10 Answers 10

I’ve been working on a library called Pyth, which can do this:

Converting an RTF file to plaintext looks something like this:

from pyth.plugins.rtf15.reader import Rtf15Reader from pyth.plugins.plaintext.writer import PlaintextWriter doc = Rtf15Reader.read(open('sample.rtf')) print PlaintextWriter.write(doc).getvalue() 

Pyth can also generate RTF files, read and write XHTML, generate documents from Python markup a la Nevow’s stan, and has limited experimental support for latex and pdf output. Its RTF support is pretty robust — we use it in production to read RTF files generated by various versions of Word, OpenOffice, Mac TextEdit, EIOffice, and others.

@Epoc, there is some work towards make it compatible to Python 3. I have one fork in my repo that you can install with pip install git+https://github.com/robertour/pyth@pyth-py3 . You can see some of the discussion here.

OpenOffice has a RTF reader. You can use python to script OpenOffice, see here for more info.

You could probably try using the magic com-object on Windows to read anything that smells ms-binary. I wouldn’t recommend that though.

Actually parsing the raw data probably won’t be very hard, see this example written in .bat/QBasic.

DocFrac is a free open source converter betweeen RTF, HTML and text. Windows, Linux, ActiveX and DLL platforms available. It will probably be pretty easy to wrap it up in python.

RTF::TEXT::Converter — Perl extension for converting RTF into text. (in case You have problems withg DocFrac).

Official Rich Text Format (RTF) Specifications, version 1.7, by Microsoft.

Good luck (with the limited privileges in Your working environment).

Источник

Оцените статью