Can HtmlAgilityPack handle an xml file that comes with an xsl file to render html?

三世轮回 提交于 2019-12-04 08:00:35

Html Agility Pack can help you here on two points:

1) it's easier to get an Xml processing instruction with it as it parses the PI data as Html, so it will transform it into attributes

2) HtmlDocument implements IXPathNavigable so it can be transformed directly by the .NET Xslt transformation engine.

Here is a piece of code that works. I had to add a specific XmlResover to handle Xslt transform properly, but I think this is specific to this skechers case.

public static void DownloadAndProcessXml(string url, string userAgent, string outputFilePath)
{
    using (XmlTextWriter writer = new XmlTextWriter(outputFilePath, Encoding.UTF8))
    {
        DownloadAndProcessXml(url, userAgent, writer);
    }
}

public static void DownloadAndProcessXml(string url, string userAgent, XmlWriter output)
{
    UserAgentXmlUrlResolver resolver = new UserAgentXmlUrlResolver(url, userAgent);

    // WebClient is an easy to use class.
    using (WebClient client = new WebClient())
    {
        // download Xml doc. set User-Agent header or the site won't answer us...
        client.Headers[HttpRequestHeader.UserAgent] = resolver.UserAgent;
        HtmlDocument xmlDoc = new HtmlDocument();
        xmlDoc.Load(client.OpenRead(url));

        // determine xslt (note the xpath trick as Html Agility Pack does not support xml processing instructions)
        string xsltUrl = xmlDoc.DocumentNode.SelectSingleNode("//*[name()='?xml-stylesheet']").GetAttributeValue("href", null);

        // download Xslt doc
        client.Headers[HttpRequestHeader.UserAgent] = resolver.UserAgent;
        XslCompiledTransform xslt = new XslCompiledTransform();
        xslt.Load(new XmlTextReader(client.OpenRead(url + xsltUrl)), new XsltSettings(true, false), null);

        // transform Html/Xml doc into new Xml doc, easy as HtmlDocument implements IXPathNavigable
        // note the use of a custom resolver to overcome this Xslt resolve requests
        xslt.Transform(xmlDoc, null, output, resolver);
    }
}

// This class is needed during transformation otherwise there are errors.
// This is probably due to this very specific Xslt file that needs to go back to the root document itself.
public class UserAgentXmlUrlResolver : XmlUrlResolver
{
    public UserAgentXmlUrlResolver(string rootUrl, string userAgent)
    {
        RootUrl = rootUrl;
        UserAgent = userAgent;
    }

    public string RootUrl { get; set; }
    public string UserAgent { get; set; }

    public override object GetEntity(Uri absoluteUri, string role, Type ofObjectToReturn)
    {
        WebClient client = new WebClient();
        if (!string.IsNullOrEmpty(UserAgent))
        {
            client.Headers[HttpRequestHeader.UserAgent] = UserAgent;
        }
        return client.OpenRead(absoluteUri);
    }

    public override Uri ResolveUri(Uri baseUri, string relativeUri)
    {
        if ((relativeUri == "/") && (!string.IsNullOrEmpty(RootUrl)))
            return new Uri(RootUrl);

        return base.ResolveUri(baseUri, relativeUri);
    }
}

And you call it like this:

    string url = "http://www.skechers.com/";
    string ua = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";
    DownloadAndProcessXml(url, ua, "skechers.html");
Brian Lyttle

You should render the output of the XML and XSLT. To do this you need to download the XML, and you've already done that. Next parse the XML to identify the XSL reference. Then you need to download the XSL and apply that to the XML document.

These links may be useful

Here is the additional code I ended up using once I received the response. Please note that this is only good if the response is "application/xml" and you will have to check for null instances of objects throughout. Also, FormAssetSrc is a private function that takes the value of the href and determines whether it is protocol, root, or document relative and creates the fully qualified uri.

var xmlStream = response.GetResponseStream();
var xmlDocument = new XPathDocument(xmlStream);
var styleNode = xmlDocument.CreateNavigator().SelectSingleNode("processing-instruction('xml-stylesheet')");
var hrefValue = Regex.Match((styleNode).Value, "href=(\"|')(?<url>.*?)(\"|')");
if(hrefValue.Success)
{
    var xslHref = FormAssetSrc(hrefValue.Groups["url"].Value, response.ResponseUri);
    var xslUri = new Uri(xslHref);
    var xslRequest = CreateWebRequest(xslUri);
    var xslResponse = (HttpWebResponse)xslRequest.GetResponse();
    var xslStream = new XPathDocument(xslResponse.GetResponseStream());
    var xslTransorm = new XslTransform();
    var sw = new System.IO.StringWriter();
    xslTransorm.Load(xslStream);
    xslTransorm.Transform(xmlDocument.CreateNavigator(), null, sw);
    page.Html.LoadHtml(sw.ToString());
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!