How Can I strip HTML from Text in .NET?

后端 未结 9 1810
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-16 02:20

I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database.

On the server, I would like to take strip the h

9条回答
  •  挽巷
    挽巷 (楼主)
    2020-12-16 02:57

    If you are just storing text for indexing then you probably want to do a bit more than just remove the HTML, such as ignoring stop-words and removing words shorter than (say) 3 characters. However, a simple tag and stripper I once wrote goes something like this:

        public static string StripTags(string value)
        {
            if (value == null)
                return string.Empty;
    
            string pattern = @"&.{1,8};";
            value = Regex.Replace(value, pattern, " ");
            pattern = @"<(.|\n)*?>";
            return Regex.Replace(value, pattern, string.Empty);
        }
    

    It's old and I'm sure it can be optimised (perhaps using a compiled reg-ex?). But it does work and may help...

提交回复
热议问题