I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database.
On the server, I would like to take strip the h
I downloaded the HtmlAgilityPack and created this function:
string StripHtml(string html)
{
// create whitespace between html elements, so that words do not run together
html = html.Replace(">","> ");
// parse html
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
// strip html decoded text from html
string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
// replace all whitespace with a single space and remove leading and trailing whitespace
return Regex.Replace(text, @"\s+", " ").Trim();
}
You can use HTQL COM, and query the source with a query: <body> &tx;
Take a look at this Strip HTML tags from a string using regular expressions
You could:
As you may have malformed HTML in the system: BeautifulSoup or similar could be used.
It is written in Python; I am not sure how it could be interfaced - using the .NET language IronPython?
Here's Jeff Atwood's RefactorMe code link for his Sanitize HTML method