I want to read the website text without html tags and headers. i just need the text displayed in the web browser.
i don't need like this
<html>
<body>
bla bla </td><td>
bla bla
<body>
<html>
i just need the text "bla bla bla bla".
I have used the webclient and httpwebrequest methods to get the HTML content and to split the received data but it is not possible because if i change the website the tags may change.
So is there any way to get only the displayed text in the website anagrammatically?
Here is how you would do it using the HtmlAgilityPack.
First your sample HTML:
var html = "<html>\r\n<body>\r\nbla bla </td><td>\r\nbla bla \r\n<body>\r\n<html>";
Load it up (as a string in this case):
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
If getting it from the web, similar:
var web = new HtmlWeb();
var doc = web.Load(url);
Now select only text nodes with non-whitespace and trim them.
var text = doc.DocumentNode.Descendants()
.Where(x => x.NodeType == HtmlNodeType.Text && x.InnerText.Trim().Length > 0)
.Select(x => x.InnerText.Trim());
You can get this as a single joined string if you like:
String.Join(" ", text)
Of course this will only work for simple web pages. Anything complex will also return nodes with data you clearly don't want, such as javascript functions etc.
You need to use special HTML parser. The only way to get the content of the such non regular language.
public string GetwebContent(string urlForGet)
{
// Create WebClient
var client = new WebClient();
// Download Text From web
var text = client.DownloadString(urlForGet);
return text.ToString();
}
I think this link can help you.
/// <summary>
/// Remove HTML tags from string using char array.
/// </summary>
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
// Reading Web page content in c# program
//Specify the Web page to read
WebRequest request = WebRequest.Create("http://aspspider.info/snallathambi/default.aspx");
//Get the response
WebResponse response = request.GetResponse();
//Read the stream from the response
StreamReader reader = new StreamReader(response.GetResponseStream());
//Read the text from stream reader
string str = reader.ReadLine();
for(int i=0;i<200;i++)
{
str += reader.ReadLine();
}
Console.Write(str);
来源:https://stackoverflow.com/questions/10579292/how-to-read-the-website-content-in-c