How to make asynchronous calls using HtmlAgilityPack?

浪子不回头ぞ 提交于 2019-12-06 05:53:25

The main issue with this page is that content inside table-matches is loaded via ajax. And neither HttpClient nor HtmlAgilityPack unable to wait for ajax to be executed. Therefore, you need different approach.

Approach #1 - Use any headless browser like PuppeteerSharp

using PuppeteerSharp;
using System;
using System.Threading.Tasks;

namespace PuppeteerSharpDemo
{
    class Program
    {
        private static String url = "http://www.oddsportal.com/matches/soccer/20180701/";

        static void Main(string[] args)
        {
            var htmlAsTask = LoadAndWaitForSelector(url, "#table-matches .table-main");
            htmlAsTask.Wait();
            Console.WriteLine(htmlAsTask.Result);

            Console.ReadKey();
        }

        public static async Task<string> LoadAndWaitForSelector(String url, String selector)
        {
            var browser = await Puppeteer.LaunchAsync(new LaunchOptions
            {
                Headless = true,
                ExecutablePath = @"c:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
            });
            using (Page page = await browser.NewPageAsync())
            {
                await page.GoToAsync(url);
                await page.WaitForSelectorAsync(selector);
                return await page.GetContentAsync();
            }
        }
    }
}

In purpose of cleanness, I've posted output here here. And once you get html content you are able to parse it with HtmlAgilityPack.

Approach #2 - Use pure Selenium WebDriver. Can be launched in headless mode.

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;
using System;

namespace SeleniumDemo
{
    class Program
    {
        private static IWebDriver webDriver;
        private static TimeSpan defaultWait = TimeSpan.FromSeconds(10);
        private static String targetUrl = "http://www.oddsportal.com/matches/soccer/20180701/";
        private static String driversDir = @"../../Drivers/";

        static void Main(string[] args)
        {
            webDriver = new ChromeDriver(driversDir);
            webDriver.Navigate().GoToUrl(targetUrl);
            IWebElement table = webDriver.FindElement(By.Id("table-matches"));
            var innerHtml = table.GetAttribute("innerHTML");
        }

        #region (!) I didn't even use this, but it can be useful (!)
        public static IWebElement FindElement(By by)
        {
            try
            {
                WaitForAjax();
                var wait = new WebDriverWait(webDriver, defaultWait);
                return wait.Until(driver => driver.FindElement(by));
            }
            catch
            {
                return null;
            }
        }

        public static void WaitForAjax()
        {
            var wait = new WebDriverWait(webDriver, defaultWait);
            wait.Until(d => (bool)(d as IJavaScriptExecutor).ExecuteScript("return jQuery.active == 0"));
        }
        #endregion
    }
}

Approach #3 - Simulate ajax requests

If you analyse the page loading using Fiddler or browser's profiler (F12) you can see that all data is coming with these two requests:

So you can try to execute them directly using HttpClient. But in this case you may need to track authorization headers and maybe something else with each HTTP request.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!