Pulling data from a webpage, parsing it for specific pieces, and displaying it

我只是一个虾纸丫 提交于 2019-11-26 17:21:14

This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements.

protected void Page_Load(object sender, EventArgs e)
{
    string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
    var web = new HtmlAgilityPack.HtmlWeb();
    HtmlDocument doc = web.Load(url);

    string metascore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
    string userscore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
    string summary = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}

An easy way to obtain the XPath for a given element is by using your web browser (I use Chrome) Developer Tools:

  • Open the Developer Tools (F12 or Ctrl + Shift + C on Windows or Command + Shift + C for Mac).
  • Select the element in the page that you want the XPath for.
  • Right click the element in the "Elements" tab.
  • Click on "Copy as XPath".

You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.

You have to make sure you use some error handling techniques because Web Scrapping can cause errors if they change the HTML formatting of the page.

Edit

Per @knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:

https://www.nuget.org/packages/HtmlAgilityPack/

I looked and Metacritic.com doesn't have an API.

You can use an HttpWebRequest to get the contents of a website as a string.

using System.Net;
using System.IO;
using System.Windows.Forms;

string result = null;
string url = "http://www.stackoverflow.com";
WebResponse response = null;
StreamReader reader = null;

try
{
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
    request.Method = "GET";
    response = request.GetResponse();
    reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
    result = reader.ReadToEnd();
}
catch (Exception ex)
{
    // handle error
    MessageBox.Show(ex.Message);
}
finally
{
    if (reader != null)
        reader.Close();
    if (response != null)
        response.Close();
}

Then you can parse the string for the data that you want by taking advantage of Metacritic's use of meta tags. Here's the information they have available in meta tags:

  • og:title
  • og:type
  • og:url
  • og:image
  • og:site_name
  • og:description

The format of each tag is: meta name="og:title" content="In a World..."

Jason Goemaat

I recommend Dcsoup. There's a nuget package for it and it uses CSS selectors so it is familiar if you use jquery. I've tried others but it is the best and easiest to use that I've found. There's not much documentation, but it's open source and a port of the java jsoup library that has good documentation. (Documentation for the .NET API here.) I absolutely love it.

var timeoutInMilliseconds = 5000;
var uri = new Uri("http://www.metacritic.com/game/pc/fallout-4");
var doc = Supremes.Dcsoup.Parse(uri, timeoutInMilliseconds);

// <span itemprop="ratingValue">86</span>
var ratingSpan = doc.Select("span[itemprop=ratingValue]");
int ratingValue = int.Parse(ratingSpan.Text);

// selectors match both critic and user scores
var scoreDiv = doc.Select("div.score_summary");
var scoreAnchor = scoreDiv.Select("a.metascore_anchor");
int criticRating = int.Parse(scoreAnchor[0].Text);
float userRating = float.Parse(scoreAnchor[1].Text);
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!