How to remove all tags and get the pure text?

问题

I had to store the user input text in my database with HTML and CSS formats.

The case is:

RadEditor ,The user copy the text from MSWord to this editor then i store this text in the database with that format . then when retrieve the data in the report or some label some tags appear wrapping the text !!

I use regular expression to remove all the formats but in vain it succeeds sometimes and not all the time .

private static Regex oClearHtmlScript = new Regex(@"<(.|\n)*?>", RegexOptions.Compiled);

        public static string RemoveAllHTMLTags(string sHtml)
        {

            sHtml = sHtml.Replace("&nbsp;", string.Empty);
            sHtml = sHtml.Replace("&gt;", ">");
            sHtml = sHtml.Replace("&lt;", "<");
            sHtml = sHtml.Replace("&amp;", "&");
            if (string.IsNullOrEmpty(sHtml))
                return string.Empty;

            return oClearHtmlScript.Replace(sHtml, string.Empty);
        }

I ask How to remove all the format using HTMLAgility or any dependable way to ensure the text is pure ?

Note:The datatype of this field in the database is Lvarchar

回答1:

This should strip out all html tags from a string.

sHtml = Regex.Replace(sHtml, "<.*?>", "");

回答2:

This post recommonds the following approach (and seems to have been accepted).

Regex.Replace(myHTMLString, @"<p>|</p>|<br>|<br />", "\r\n", );
Regex.Replace(myHTMLString, @"<.+?>", string.Empty);

Given you're still having difficulty could you try instantiating a RadEditor and using the .Text property. Ive not used RadEditor before but I did some digging - could you try something like thisL

RadEditor editor = new RadEditor();
editor.Content = myHTMLString;
string plainText = editor.Text;

This is probably a VERY expensive operation but Id be interested to know if it works!

回答3:

HtmlAgility pack makes working with HTML easy.

HtmlDocument mainDoc = new HtmlDocument();
string htmlString = "<html><body><h1>Test</h1> more text</body></html>"
mainDoc.LoadHtml(htmlString);
string cleanText = mainDoc.DocumentNode.InnerText;

回答4:

See my answer here for how it can be done using the Agility Pack. You may have to change the code a little to not strip out words less than two characters though. Also, line breaks will be removed as well, so you'll be left with one long line of text.

来源：https://stackoverflow.com/questions/16303828/how-to-remove-all-tags-and-get-the-pure-text

标签

ASP.NET

html

regex

html-agility-pack

informix