What's the best way to remove HTML from a string?

会有一股神秘感。 提交于 2020-01-24 09:52:26

问题


I recently started using the following RegEx in a ReReplace() function to strip HTML tags from a string using ColdFusion. Please note: I am not using this as protection from XSS or SQL injection; this is only to remove existing and safe HTML from a string before it's displayed in an HTML title attribute.

REReplaceNoCase(str,"<[^>]*>","","ALL")

In a semi-related question I asked how to modify my RegEx to include spaces and line breaks. I was told that using RegEx for this purpose is not appropriate and this post was referenced as an explanation.

I strongly suspect though that the regular expressions you have posted don't in fact work correctly. I'd advise you not to use regular expressions to parse HTML as HTML is not a regular language. Use an HTML parser instead. (Mark Byers)

If this is true, what is the appropriate tool for removing HTML from a string before it's displayed? (Baring in mind the HTML is already safe; it's sanitized before entry to the DB).

I am aware of HTMLEditFormat() and HTMLCodeFormat(), but those two functions do not provide what I need; the earlier replaces special characters with their HTML-escaped equivalents, while the latter does exactly the same but also wraps the string a <pre> tag.

What I would like to do is clean a string from HTML and line breaks before I display in an HTML title attribute <a title="My string without HTML goes here">...</a>

There are times when the HTML is not necessary. Say you wanted to display an excerpt from a post without the HTML stored along with it, for instance.


回答1:


I disagree with the reasoning you quote. While HTML should not be parsed with regexen, stripping tags is perfect for them.

But you will want to be more careful than just <[^>]*>, since that would turn

<span title=">">...</span>

into the ill-formed

">...</span>

So you need something like <([^">]|"[^"]*"|'[^']*')*> instead. You can strip out line breaks with character replacement instead of a regex, but if you prefer a regex you can use something like \n (or even combine it with the above using alternation, but that's even less efficient).




回答2:


Use chilkat html parser chilkat. We used this in my academic project to fetch all the content and hyperlinks from html pages to build a basic search engine.




回答3:


If the HTML snippet is to be included in a title, you can probably cover all bases with regexes and enough testing.

Still, as a general hint, if you have to handle a larger snippet, I'd go the XML/DOM way with Java, either by parsing with dom4j and grabbing the text or more likely by Stringbuilding the result with a SAX parser.

[EDIT]When I first answered, I was about to write that the HTML must be reasonably well-formed, but assumed you at least a bit of control on the source. If you don't have it, though, I'll just link quickly to JTidy and TagSoup without, of course, having tested either, but they are definitely the first thing I would test to consume real-world HTML with CF.



来源:https://stackoverflow.com/questions/4550583/whats-the-best-way-to-remove-html-from-a-string

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!