Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?

前端 未结 2 1680
面向向阳花
面向向阳花 2020-11-22 08:30

This is meant to provide a canonical Q&A to all that similar (but much too specific questions to be a close target candidate) popping up once or twice a week.

2条回答
  •  滥情空心
    2020-11-22 09:32

    Just came across the same problem. I almost wrote a recursive funtion to check for every tbody tag if it exists and traverse the dom that way, then I remembered I know regex. :)

    Before parsing, get the html as a string. Insert missing and tags with regex, then load it back into your DOMDocument object.

    Jens Erat gives a good explanation, but here is

    Solution 4: Make sure the HTML source always has the tags with regex

    JavaScript
        var html = '
    foobar
    '; html.replace(/(]+)?>([^<>]+)?)(?!]+)?>)/g,"$1").replace(/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/g,"$1$4"); PHP $html = $dom->saveHTML(); $html = preg_replace(array('/(]+)?>([^<>]+)?)(?!]+)?>)/','/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/'),array('$1','$1$4'),$html); $dom->loadHTML($html);

    Just the regex:

    matches `` tag with whatever else junk inside the tag and between this and the next tag if the next tag is NOT `` also with stuff inside the tag
    
        /(]+)?>([^<>]+)?)(?!]+)?>)/
    
    replace with
    
        $1
    
    the $1 referencing the captured `
    ` tag with contents. Do the same for the closing tag like this: /(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/ replace with $1$4

    This way the dom will ALWAYS have the

    tags where necessary.

    提交回复
    热议问题