Cleaning up RTF text

只愿长相守 提交于 2019-11-30 14:56:53

问题


I'd like to take some RTF input and clean it to remove all RTF formatting except \ul \b \i to paste it into Word with minor format information.

The command used to paste into Word will be something like: oWord.ActiveDocument.ActiveWindow.Selection.PasteAndFormat(0) (with some RTF text already in the Clipboard)

{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 Courier New;}}
{\colortbl ;\red255\green255\blue140;}
\viewkind4\uc1\pard\highlight1\lang3084\f0\fs18 The company is a global leader in responsible tourism and was \ul the first major hotel chain in North America\ulnone  to embrace environmental stewardship within its daily operations\highlight0\par

Do you have any idea on how I can clean up the RTF safely with some regular expressions or something? I am using VB.NET to do the processing but any .NET language sample will do.


回答1:


I would use a hidden RichTextBox, set the Rtf member, then retrieve the Text member to sanitize the RTF in a well-supported way. Then I would use manually inject the desired formatting afterwards.




回答2:


I'd do something like the following:

Dim unformatedtext As String

someRTFtext = Replace(someRTFtext, "\ul", "[ul]")
someRTFtext = Replace(someRTFtext, "\b", "[b]")
someRTFtext = Replace(someRTFtext, "\i", "[i]")

Dim RTFConvert As RichTextBox = New RichTextBox
RTFConvert.Rtf = someRTFtext
unformatedtext = RTFConvert.Text

unformatedtext = Replace(unformatedtext, "[ul]", "\ul")
unformatedtext = Replace(unformatedtext, "[b]", "\b")
unformatedtext = Replace(unformatedtext, "[i]", "\i")

Clipboard.SetText(unformatedtext)

oWord.ActiveDocument.ActiveWindow.Selection.PasteAndFormat(0)



回答3:


You can strip out the tags with regular expressions. Just make sure that your expressions will not filter tags that were actually text. If the text had "\b" in the body of text, it would appear as \b in the RTF stream. In other words, you would match on "\b" but not "\b".

You could probably take a short cut and filter out the header RTF tags. Look for the first occurrence of "\viewkind4" in the input. Then read ahead to the first space character. You would remove all of the characters from the start of the text up to and including that space character. That would strip out the RTF header information (fonts, colors, etc).




回答4:


Regex it, it wont parse absolutely everything correctly (tables for example) but does the job in most cases.

string unformatted = Regex.Replace(rtfString, @"\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?", "");

Magic =)



来源:https://stackoverflow.com/questions/20450/cleaning-up-rtf-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!