Parsing HTML to get content using C#

后端 未结 4 727
半阙折子戏
半阙折子戏 2020-11-29 09:33

I am writing an application that crawls a group of my web pages. Rather than take the entire source code of the page I\'d like to take all of the content and store that and

4条回答
  •  再見小時候
    2020-11-29 10:30

    Below function will help to remove all HTML tags, scripts, css, styles from html string and convert it to a plain text. view source

    private string GetPlainTextFromHtml(string htmlString)
    {
        string htmlTagPattern = "<.*?>";
        var regexCss = new Regex("(\\)|(\\)", RegexOptions.Singleline | RegexOptions.IgnoreCase);
        htmlString = regexCss.Replace(htmlString, string.Empty);
        htmlString = Regex.Replace(htmlString, htmlTagPattern, string.Empty);
        htmlString = Regex.Replace(htmlString, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
        htmlString = htmlString.Replace(" ", string.Empty);
    
        return htmlString;
    }
    

提交回复
热议问题