How to get everything between two HTML tags? (with XPath?)

问题

EDIT : I've added a solution which works in this case.

I want to extract a table from a page and I want to do this (probably) with a DOMDocument and XPath. But if you've got a better idea, tell me.

My first attempt was this (obviously faulty, because it will get the first closing table tag):

<?php 
    $tableStart = strpos($source, '<table class="schedule"');
    $tableEnd   = strpos($source, '</table>', $tableStart);
    $rawTable   = substr($source, $tableStart, ($tableEnd - $tableStart));
?>

I tough, this might be solvable with a DOMDocument and/or xpath...

In the end I want everything between the tags (in this case, the tags), and the tags them self. So all HTML, not just the values (e.g. Not just 'Value' but 'Value'). And there is one 'catch'...

The table has in it, other tables. So if you just search for the end of the table (' tag') you get probably the wrong tag.
The opening tag has a class with which you can identify it (classname = 'schedule').

Is this possible?

This is the (simplified) source piece that I want to extract from another website: (I also want to display the html tags, not just the values, so the whole table with the class 'schedule')

<table class="schedule">
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- a problematic tag...

    This could even be variable content. =O =S

</table>

回答1:

First of all, do note that XPath is based on the XML Infopath -- a model of XML where there are no "starting tag" and "ending tag" bu there are only nodes

Therfore, one shouldn't expect an XPath expression to select "tags" -- it selects nodes.

Taking this fact into account, I interpret the question as:

I want to obtain the set of all elements that are between a given "start" element and a given "end element", including the start and end elements.

In XPath 2.0 this can be done conveniently with the standard operator intersect.

In XPath 1.0 (which I assume you are using) this is not so easy. The solution is to use the Kayessian (by @Michael Kay) formula for node-set intersection:

The intersection of two node-sets: $ns1 and $ns2 is selected by evaluating the following XPath expression:

$ns1[count(.|$ns2) = count($ns2)]

Let's assume that we have the following XML document (as you never provided one):

<html>
    <body>
        <table>
            <tr valign="top">
                <td>
                    <table class="target">
                        <tr>
                            <td>Other Node</td>
                            <td>Other Node</td>
                            <td>Starting Node</td>
                            <td>Inner Node</td>
                            <td>Inner Node</td>
                            <td>Inner Node</td>
                            <td>Ending Node</td>
                            <td>Other Node</td>
                            <td>Other Node</td>
                            <td>Other Node</td>
                        </tr>
                    </table>
                </td>
            </tr>
        </table>
    </body>
</html>

The start-element is selected by:

//table[@class = 'target']
         //td[. = 'Starting Node']

The end-element is selected by:

//table[@class = 'target']
         //td[. = Ending Node']

To obtain all wanted nodes we intersect the following two sets:

The set consisting of the start elementand all following elements (we name this $vFollowing).
The set consisting of the end element and all preceding elements (we name this $vPreceding).

These are selected, respectively by the following XPath expressions:

$vFollowing:

$vStartNode | $vStartNode/following::*

$vPreceding:

$vEndNode | $vEndNode/preceding::*

Now we can simply apply the Kayessian formula on the nodesets $vFollowing and $vPreceding:

       $vFollowing
          [count(.|$vPreceding)
          =
           count($vPreceding)
          ]

What remains is to substitute all variables with their respective expressions.

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vStartNode" select=
 "//table[@class = 'target']//td[. = 'Starting Node']"/>

 <xsl:variable name="vEndNode" select=
 "//table[@class = 'target']//td[. = 'Ending Node']"/>

 <xsl:variable name="vFollowing" select=
 "$vStartNode | $vStartNode/following::*"/>

 <xsl:variable name="vPreceding" select=
 "$vEndNode | $vEndNode/preceding::*"/>

 <xsl:template match="/">
      <xsl:copy-of select=
          "$vFollowing
              [count(.|$vPreceding)
              =
               count($vPreceding)
              ]"/>
 </xsl:template>
</xsl:stylesheet>

when applied on the XML document above, the XPath expressions are evaluated and the wanted, correct resulting-selected node-set is output:

<td>Starting Node</td>
<td>Inner Node</td>
<td>Inner Node</td>
<td>Inner Node</td>
<td>Ending Node</td>

回答2:

Do not use regexes (or strpos...) to parse HTML!

Part of why this problem was difficult for you is you are thinking in "tags" instead of "nodes" or "elements". Tags are an artifact of serialization. (HTML has optional end tags.) Nodes are the actual data structure. A DOMDocument has no "tags", only "nodes" arranged in the proper tree structure.

Here is how you get your table with XPath:

// This is a simple solution, but only works if the value of "class" attribute is exactly "schedule"
// $xpath = '//table[@class="schedule"]';

// This is what you want. It is equivalent to the "table.schedule" css selector:
$xpath = "//table[contains(concat(' ',normalize-space(@class),' '),' schedule ')]";

$d = new DOMDocument();
$d->loadHTMLFile('http://example.org');
$xp = new DOMXPath($d);
$tables = $xp->query($xpath);
foreach ($tables as $table) {
    $table; // this is a DOMElement of a table with class="schedule"; It includes all nodes which are children of it.
}

回答3:

If you have well formed HTML like this

<html>
<body>
    <table>
        <tr valign='top'>
            <td>
                <table class='inner'>
                    <tr><td>Inner Table</td></tr>
                </table>
            </td>
            <td>
                <table class='second inner'>
                    <tr><td>Second  Inner</td></tr>
                </table>
            </td>
        </tr>
    </table>
</body>
</html>

Output the nodes (in an xml wrapper) with this pho code

<?php
    $xml = new DOMDocument();
    $strFileName = "t.xml";
    $xml->load($strFileName);

    $xmlCopy = new DOMDocument();
    $xmlCopy->loadXML( "<xml/>" ); 

    $xpath = new domxpath( $xml );
    $strXPath = "//table[@class='inner']";

    $elements = $xpath->query( $strXPath, $xml );
    foreach( $elements as $element ) {
        $ndTemp = $xmlCopy->importNode( $element, true );
        $xmlCopy->documentElement->appendChild( $ndTemp );
    }
    echo $xmlCopy->saveXML();
?>

回答4:

This gets the whole table. But it can be modified to let it grab another tag. This is quite a case specific solution which can only be used onder specific circumstances. Breaks if html, php or css comments containt the opening or closing tag. Use it with caution.

Function:

// **********************************************************************************
// Gets a whole html tag with its contents.
//  - Source should be a well formatted html string (get it with file_get_contents or cURL)
//  - You CAN provide a custom startTag with in it e.g. an id or something else (<table style='border:0;')
//    This is recommended if it is not the only p/table/h2/etc. tag in the script.
//  - Ignores closing tags if there is an opening tag of the same sort you provided. Got it?
function getTagWithContents($source, $tag, $customStartTag = false)
{

    $startTag = '<'.$tag;
    $endTag   = '</'.$tag.'>';

    $startTagLength = strlen($startTag);
    $endTagLength   = strlen($endTag);

//      ***************************** 
    if ($customStartTag)
        $gotStartTag = strpos($source, $customStartTag);
    else
        $gotStartTag = strpos($source, $startTag);

    // Can't find it?
    if (!$gotStartTag)
        return false;       
    else
    {

//      ***************************** 

        // This is the hard part: finding the correct closing tag position.
        // <table class="schedule">
        //     <table>
        //     </table> <-- Not this one
        // </table> <-- But this one

        $foundIt          = false;
        $locationInScript = $gotStartTag;
        $startPosition    = $gotStartTag;

        // Checks if there is an opening tag before the start tag.
        while ($foundIt == false)
        {
            $gotAnotherStart = strpos($source, $startTag, $locationInScript + $startTagLength);
            $endPosition        = strpos($source, $endTag,   $locationInScript + $endTagLength);

            // If it can find another opening tag before the closing tag, skip that closing tag.
            if ($gotAnotherStart && $gotAnotherStart < $endPosition)
            {               
                $locationInScript = $endPosition;
            }
            else
            {
                $foundIt  = true;
                $endPosition = $endPosition + $endTagLength;
            }
        }

//      ***************************** 

        // cut the piece from its source and return it.
        return substr($source, $startPosition, ($endPosition - $startPosition));

    } 
}

Application of the function:

$gotTable = getTagWithContents($tableData, 'table', '<table class="schedule"');
if (!$gotTable)
{
    $error = 'Faild to log in or to get the tag';
}
else
{
    //Do something you want to do with it, e.g. display it or clean it...
    $cleanTable = preg_replace('|href=\'(.*)\'|', '', $gotTable);
    $cleanTable = preg_replace('|TITLE="(.*)"|', '', $cleanTable);
}

Above you can find my final solution to my problem. Below the old solution out of which I made a function for universal use.

Old solution:

// Try to find the table and remember its starting position. Check for succes.
// No success means the user is not logged in.
$gotTableStart = strpos($source, '<table class="schedule"');
if (!$gotTableStart)
{
    $err = 'Can\'t find the table start';
}
else
{

//      ***************************** 
    // This is the hard part: finding the closing tag.
    $foundIt          = false;
    $locationInScript = $gotTableStart;
    $tableStart       = $gotTableStart;

    while ($foundIt == false)
    {
        $innerTablePos = strpos($source, '<table', $locationInScript + 6);
        $tableEnd      = strpos($source, '</table>', $locationInScript + 7);

        // If it can find '<table' before '</table>' skip that closing tag.
        if ($innerTablePos != false && $innerTablePos < $tableEnd)
        {               
            $locationInScript = $tableEnd;
        }
        else
        {
            $foundIt  = true;
            $tableEnd = $tableEnd + 8;
        }
    }

//      ***************************** 

    // Clear the table from links and popups...
    $rawTable   = substr($tableData, $tableStart, ($tableEnd - $tableStart));

}

来源：https://stackoverflow.com/questions/8950582/how-to-get-everything-between-two-html-tags-with-xpath

标签

php

xpath

screen-scraping