C++: Remove all HTML formatting from string?

匆匆过客 提交于 2019-12-12 18:23:19

问题


I have a string which might include br or span.../span tags or other HTML characters/entities. I want a robust way of stripping all that and getting the remaining UTF-8 characters. This be should be cross-platform, ideally.

Something like this would be ideal:

http://snipplr.com/view/15261/python-decode-and-strip-html-entites-to-unicode/

but that also removes the tags.


回答1:


Just how stringent are your requirements? A simple two-state FSA ought to do. Start in the READCHAR state. Whenever you read a '<' in that state, transition to the READTAG state; otherwise, write the character to your result string. Whenever you're in the READTAG state and read a '>', transition back to the READCHAR state.

Edit: Oops. Missed the part of about entities. You'll nead a READENTITY state for that too. When you transition out of it, you could also convert the code into the corresponding UTF-8 character.




回答2:


I'm not clear on what you want.

Input: This is a string

of text & on many lines "

Should this output:

1) This is a string <br> <br /> of text & on many lines "   (Replace &amp; with & and &quot; with ") 
2) This is a string of text & on many lines "



回答3:


Do you want to simply delete the elements, or to convert HTML to plain text?

Option 1:

If you just want to delete all occurances of and you can use a regex search and replace.

Option 2:

If what you're really trying to do is take a page that has formatting and convert it to plain text, the simplest and most robust way I can think of is to use a browser, or some browser engine, to actually parse the HTML and extract the text from it.

IOW, this is equivalent to copying a web page from the browser into the clipboard and then pasting it into notepad.



来源:https://stackoverflow.com/questions/979071/c-remove-all-html-formatting-from-string

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!