How to read the Website content in c#?

浪子不回头ぞ 提交于 2019-12-01 04:13:28

Here is how you would do it using the HtmlAgilityPack.

First your sample HTML:

var html = "<html>\r\n<body>\r\nbla bla </td><td>\r\nbla bla \r\n<body>\r\n<html>";

Load it up (as a string in this case):

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

If getting it from the web, similar:

var web = new HtmlWeb();
var doc = web.Load(url);

Now select only text nodes with non-whitespace and trim them.

var text = doc.DocumentNode.Descendants()
              .Where(x => x.NodeType == HtmlNodeType.Text && x.InnerText.Trim().Length > 0)
              .Select(x => x.InnerText.Trim());

You can get this as a single joined string if you like:

String.Join(" ", text)

Of course this will only work for simple web pages. Anything complex will also return nodes with data you clearly don't want, such as javascript functions etc.

Tigran

You need to use special HTML parser. The only way to get the content of the such non regular language.

See: What is the best way to parse html in C#?

public string GetwebContent(string urlForGet)
{
    // Create WebClient
    var client = new WebClient();
    // Download Text From web
    var text = client.DownloadString(urlForGet);
    return text.ToString();
}

I think this link can help you.

/// <summary>
/// Remove HTML tags from string using char array.
/// </summary>
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;

for (int i = 0; i < source.Length; i++)
{
    char let = source[i];
    if (let == '<')
    {
    inside = true;
    continue;
    }
    if (let == '>')
    {
    inside = false;
    continue;
    }
    if (!inside)
    {
    array[arrayIndex] = let;
    arrayIndex++;
    }
}
return new string(array, 0, arrayIndex);
}
// Reading Web page content in c# program
//Specify the Web page to read
WebRequest request = WebRequest.Create("http://aspspider.info/snallathambi/default.aspx");
//Get the response
WebResponse response = request.GetResponse(); 
//Read the stream from the response
StreamReader reader = new StreamReader(response.GetResponseStream()); 
//Read the text from stream reader
string str = reader.ReadLine();
for(int i=0;i<200;i++)
{
   str += reader.ReadLine();

}

Console.Write(str);
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!