Parse HTML using with an Ant Script

江枫思渺然 提交于 2019-12-17 16:40:37

问题


I need to retrieve some values from an HTML file. I need to use Ant so I can use these values in other parts of my script.

Can this even be achieved in Ant?


回答1:


As stated in the other answers you can't do this in "pure" XML. You need to embed a programming language. My personal favourite is Groovy, it's integration with ANT is excellent.

Here's a sample which retrieves the logo URL, from the groovy homepage:

parse:

print:
     [echo] 
     [echo]         Logo URL: http://groovy.codehaus.org/images/groovy-logo-medium.png
     [echo]     

build.xml

Build uses the ivy plug-in to retrieve all 3rd party dependencies.

<project name="demo" default="print" xmlns:ivy="antlib:org.apache.ivy.ant">

    <target name="resolve">
        <ivy:resolve/>
        <ivy:cachepath pathid="build.path" conf="build"/>
    </target>

    <target name="parse" depends="resolve">
        <taskdef name="groovy" classname="org.codehaus.groovy.ant.Groovy" classpathref="build.path"/>

        <groovy>
        import org.htmlcleaner.*

        def address = 'http://groovy.codehaus.org/'

        // Clean any messy HTML
        def cleaner = new HtmlCleaner()
        def node = cleaner.clean(address.toURL())

        // Convert from HTML to XML
        def props = cleaner.getProperties()
        def serializer = new SimpleXmlSerializer(props)
        def xml = serializer.getXmlAsString(node)

        // Parse the XML into a document we can work with
        def page = new XmlSlurper(false,false).parseText(xml)

        // Retrieve the logo URL
        properties["logo"] = page.body.div[0].div[1].div[0].div[0].div[0].img.@src
        </groovy>
    </target>

    <target name="print" depends="parse">
        <echo>
        Logo URL: ${logo}
        </echo>
    </target>

</project>

The parsing logic is pure groovy programming. I love the way you can easily walk the page's DOM tree:

// Retrieve the logo URL
properties["logo"] = page.body.div[0].div[1].div[0].div[0].div[0].img.@src

ivy.xml

Ivy is similar to Maven. It manages your dependencies on 3rd party software. Here it's being used to pull down groovy and the HTMLCleaner library the groovy logic is using:

<ivy-module version="2.0">
    <info organisation="org.myspotontheweb" module="demo"/>
    <configurations defaultconfmapping="build->default">
        <conf name="build" description="ANT tasks"/>
    </configurations>
    <dependencies>
        <dependency org="org.codehaus.groovy" name="groovy-all" rev="1.8.2"/>
        <dependency org="net.sourceforge.htmlcleaner" name="htmlcleaner" rev="2.2"/>
    </dependencies>
</ivy-module>

How to install ivy

Ivy is a standard ANT plugin. Download it's jar and place it in one of the following directories:

$HOME/.ant/lib
$ANT_HOME/lib

I don't know why the ANT project doesn't ship with ivy.




回答2:


Yes this is very possible.

Note that in order to use this solution you will need to set your JAVA_HOME variable to JRE 1.6 or later.

<project name="extractElement" default="test">
<!--Extract element from html file-->
<scriptdef name="findelement" language="javascript">
     <attribute name="tag" />
     <attribute name="file" />
     <attribute name="property" />
     <![CDATA[
       var tag = attributes.get("tag");
       var file = attributes.get("file");
       var regex = "<" + tag + "[^>]*>(.*?)</" + tag + ">";
       var patt = new RegExp(regex,"g");
       project.setProperty(attributes.get("property"), patt.exec(file));
     ]]>
</scriptdef>

<!--Only available target...-->
<target name="test">
    <!--Load html file into property-->
    <loadfile srcFile="D:\Tools\CruiseControl\Build\artifacts\RECO\20110831100942\RECO_merged_report.html" property="html.file"/>
    <!--Find element with specific tag and save it to property element-->
    <findelement tag="title" file="${html.file}" property="element"/>
    <echo message="File : ${html.file}"/>
    <echo message="Title : ${element}"/>
</target>
</project>

Output : [echo] Title : <title>Test Report</title>,Test Report

As I don't know what exactly variables you were looking for this particular solution will find all elements that you specify in the tag attribute. Of course you could modify the regex to suit your own specific needs.

Also this is pure build.xml ant with no external dependencies whatsoever.




回答3:


Sure, but you have to write your own task for it. Visit http://ant.apache.org/manual/develop.html#writingowntask for more information about writing own tasks for Ant. In your Ant task you may parse your HTML file as needed.

I claim, that it is not directly possible with "pure" XML (build.xml) to achieve what you want.




回答4:


Take a look at the (http://ant.apache.org/manual/Tasks/xmlproperty.html) task and see if it'll work for you. It's pretty straight forward:

<xmlProperty file="${html.file}"
   prefix="html."/>

After all, HTML is just a subset of XML. I've used it before to do this very task. No need to write your own task or script.



来源:https://stackoverflow.com/questions/7428855/parse-html-using-with-an-ant-script

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!