How to remove all tags and get the pure text?

隐身守侯 提交于 2019-12-11 03:32:56

问题


I had to store the user input text in my database with HTML and CSS formats.

The case is:

RadEditor ,The user copy the text from MSWord to this editor then i store this text in the database with that format . then when retrieve the data in the report or some label some tags appear wrapping the text !!

I use regular expression to remove all the formats but in vain it succeeds sometimes and not all the time .

private static Regex oClearHtmlScript = new Regex(@"<(.|\n)*?>", RegexOptions.Compiled);

        public static string RemoveAllHTMLTags(string sHtml)
        {

            sHtml = sHtml.Replace("&nbsp;", string.Empty);
            sHtml = sHtml.Replace("&gt;", ">");
            sHtml = sHtml.Replace("&lt;", "<");
            sHtml = sHtml.Replace("&amp;", "&");
            if (string.IsNullOrEmpty(sHtml))
                return string.Empty;

            return oClearHtmlScript.Replace(sHtml, string.Empty);
        }

I ask How to remove all the format using HTMLAgility or any dependable way to ensure the text is pure ?

Note:The datatype of this field in the database is Lvarchar


回答1:


This should strip out all html tags from a string.

sHtml = Regex.Replace(sHtml, "<.*?>", "");



回答2:


This post recommonds the following approach (and seems to have been accepted).

Regex.Replace(myHTMLString, @"<p>|</p>|<br>|<br />", "\r\n", );
Regex.Replace(myHTMLString, @"<.+?>", string.Empty);

Given you're still having difficulty could you try instantiating a RadEditor and using the .Text property. Ive not used RadEditor before but I did some digging - could you try something like thisL

RadEditor editor = new RadEditor();
editor.Content = myHTMLString;
string plainText = editor.Text;

This is probably a VERY expensive operation but Id be interested to know if it works!




回答3:


HtmlAgility pack makes working with HTML easy.

HtmlDocument mainDoc = new HtmlDocument();
string htmlString = "<html><body><h1>Test</h1> more text</body></html>"
mainDoc.LoadHtml(htmlString);
string cleanText = mainDoc.DocumentNode.InnerText;



回答4:


See my answer here for how it can be done using the Agility Pack. You may have to change the code a little to not strip out words less than two characters though. Also, line breaks will be removed as well, so you'll be left with one long line of text.



来源:https://stackoverflow.com/questions/16303828/how-to-remove-all-tags-and-get-the-pure-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!