How to prevent lxml.etree.HTML( data ) from crashing on certain type of data?

问题

I'm running etree.HTML( data ) like below for lots of different data contents. With a specific data conent, however, lxml.etree.HTML will not parse it, but go into an infinite loop and consume 100% CPU.

Does anyone know exactly what in this data below that can be causing this? And more importantly, how can I prevent this from happening on an infinite number of random, broken data?

Edit: Turns out this is a bug with lxml version 2.7.8 and below (at least). Updated to lxml 2.9.0, and bug is gone.

Edit: I know this constitutes an infinite loop, but that's not the bad behaviour I'm getting. It runs fine (as an infinite loop) with a healthy data content. With unhealthy data content, like below, what happens is that the loop will STOP and RAM will start filling up and when it's full, all CPU goes into WAIT state. See this question for the original debug.

#!/usr/bin/python
# -*- coding: utf-8 -*-
#

import sys
from lxml import etree



data = '''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<meta charset="UTF-8">
    <title>The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked -- Grub Street New York</title>

    <link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="http://feedproxy.google.com/nymag/grubstreet" />



    <meta name="Headline" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
    <meta name="keywords" content="april bloomfield, el gordo, frank bruni, gordon ramsay, lawsuits, lists, marcus samuelsson, mario batali, shitlist, spotted pig, sued" />

    <meta name="description" content="Racism, fat-shaming, and vegetarian trickery." />

    <meta name="Byline" content="Sierra Tishgart" />
    <meta name="Type_of_Feature" content="" />
    <meta name="Issue_Date" content="March  8, 2013 12:50 PM" />
    <meta name="related_stories" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
    <meta name="document_type" content="Blog" />
    <meta name="category" content="Lists" />

    <link rel="image_src" href="http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg/a_146x97.jpg" />
    <link rel="canonical" href="http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html" id="canonical" />

    <script>
        var canonicalUrl = "http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html";
    </script>



    <meta name="content.tags.primary" content=";network - Grub Street,;city - New York City,;tag - lists" />
    <meta name="content.tags" content=";tag - april bloomfield,;tag - el gordo,;tag - frank bruni,;tag - gordon ramsay,;tag - lawsuits,;tag - marcus samuelsson,;tag - mario batali,;tag - shitlist,;tag - spotted pig,;tag - sued" />
    <meta name="content.hierarchy" content="New York City:Grub Street" />
    <meta name="content.type" content="Blog" />
    <meta name="content.subtype" content="Blog Entry" />    


    <meta property="fb:app_id" content="206283005644" />
    <meta property="og:title" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
    <meta property="og:description" content="Racism, fat-shaming, and vegetarian trickery." /> 
    <meta property="og:image" content="http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg/a_146x97.jpg"/>
    <meta property="og:url" content="http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html" />
    <meta property="og:type" content="article" />
    <meta property="og:site_name" content="Grub Street New York" />





    <meta name="viewport" content="width=1020">

<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/grubstreet-core.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/section/daily/slideshow.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/echo.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/loginRegister.css" media="all" />
<link rel="stylesheet" href="http://cache.nymag.com/css/screen/advertising.css" media="all" />
<link rel="shortcut icon" href="http://images.nymag.com/gfx/grubst/favicon.ico" />

<style type="text/css">
#adsplashtop,#pushdown {padding:5px 5px;}
#pushdown {border-top:1px solid #737373}
</style>











    <!--[if IE 6]>
    <link rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/win-ie6.css" type="text/css" media="screen, projection" />
<![endif]-->

<!--[if IE 7]>
    <link rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/win-ie7.css" type="text/css" media="screen, projection" />
<![endif]-->




    <script type="text/javascript">
        var NYM = {};
        NYM.config = {};
        NYM.config.membership = {
            "service":"nym"
        };
        NYM.config.advertising = {
            "sitename":"nym.grubstreet"
        };

    </script>




<script type="text/javascript">
    var date = 'March 12, 2013 12:42:38';
    var currDate=new Date(date);
    var GRUBST = {};
    if (!NYM) {  
        var NYM = {};
        NYM.config = {};
        NYM.config.membership = {
            "service":"nym"
        };
        NYM.config.advertising = {
             "sitename":"nym.grubstreet"
        };
    }
</script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/modernizr-1.7.min.js"></script>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/jquery-ui-1.8.2.custom.min.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/ad_manager.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/js/2/global.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/skinTakeover.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/grubstreet-controls.js"></scr
'''







n = 0
while True:
    n += 1

    tree = etree.HTML( data )
    m = tree.xpath("//meta[@property]")

    print '-', n 
    for i in m:
        print n 
        #print (i.attrib['property'], i.attrib['content'])

For quick versions, you can use:

import sys
from lxml import etree

print("%-20s: %s" % ('Python',           sys.version_info))
print("%-20s: %s" % ('lxml.etree',       etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used',      etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled',  etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used',     etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

I've got:

OS                  : Ubuntu 12.10 (AWS)
Python              : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree          : (3, 1, 0, 0)
libxml used         : (2, 7, 8)
libxml compiled     : (2, 7, 8)
libxslt used        : (1, 1, 26)
libxslt compiled    : (1, 1, 26)

回答1:

Here is a way to parse partial HTML using lxml. It seems to work-around the hanging problem which seems to occur in versions of libxml (2, 7, 8) or older:

    parser = LH.HTMLParser()
    parser.feed(data)
    root = parser.close()
    m = root.xpath('//meta[@property]')

import sys
import lxml.html as LH
import lxml.etree as ET

data = '''
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="ie6"> <![endif]-->
<!--[if IE 7]>    <html class="ie7"> <![endif]-->
<!--[if IE 8]>    <html class="ie8"> <![endif]-->
<!--[if gt IE 8]><!--> <html> <!--<![endif]-->
<head profile="http://gmpg.org/xfn/11">
 <meta charset="UTF-8">
 <title>
     Erased US data shows 1 in 4 missiles in Afghan airstrikes now fired by drone: The Bureau of Investigative Journalism  </title>

 <meta name="description" content="Drone data has been wiped from the Air Force website.">

 <meta name="generator" content="Magicalia 2010" />
 <meta name="google-site-verification" content="bGFVI6kAZGjMNNiS6LGvBDWSGydwyWQI3gogCD4xP50" />

 <link href="http://cdn-images.mailchimp.com/embedcode/slim-081711.css" rel="stylesheet" type="text/css">
 <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/screen.css" type="text/css" media="screen, projection" />
 <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/print.css" type="text/css" media="print" />
 <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/style.css?3" type="text/css" media="screen, projection" />

 <!--[if IE]>
   <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/lib/ie.css" type="text/css" media="screen, projection" />
 <![endif]-->

 <!--[if lt IE 7]>
   <script defer type="text/javascript" src="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/pngfix.js"></script>
 <![endif]-->

 <!--[if gte IE 5.5]>
   <script language="javaScript" src="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/dhtml.js" type="text/javaScript"></script>
 <![endif]-->

 <link rel="alternate" type="application/rss+xml" title="The Bureau of Investigative Journalism RSS Feed" href="http://www.thebureauinvestigates.com/feed/" />
 <link rel="pingback" href="http://www.thebureauinvestigates.com/xmlrpc.php" />

 <link rel="alternate" type="application/rss+xml" title="The Bureau of Investigative Journalism &raquo; Erased US data shows 1 in 4 missiles in Afghan airstrikes now fired by drone Comments Feed" href="http://www.thebureauinvestigates.com/2013/03/12/erased-us-data-shows-1-in-4-missiles-in-afghan-airstrikes-now-fired-by-drone/feed/" />
<link rel='stylesheet' id='mailchimp-css'  href='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/lib/mailchimp.dev.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='donate-css'  href='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/lib/donate.dev.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='tubepress-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/tubepress/src/main/web/css/tubepress.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='NextGEN-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/nextgen-gallery/css/nggallery.css?ver=1.0.0' type='text/css' media='screen' />
<link rel='stylesheet' id='shutter-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/nextgen-gallery/shutter/shutter-reloaded.css?ver=1.3.4' type='text/css' media='screen' />
<link rel='stylesheet' id='stbCSS-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/wp-special-textboxes/css/wp-special-textboxes.css.php?ver=4.3.72' type='text/css' media='all' />
<link rel='stylesheet' id='grid-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/grid.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='reveal-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/reveal.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='app-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/app.css?ver=3.5.1' type='text/css' media='all' />
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-includes/js/jquery/jquery.js?ver=1.8.3'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/plugins/tubepress/src/main/web/js/tubepress.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/jquery.cycle.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/search.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/nav/superfish.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/nav/supersubs.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/home.js?ver=3.5.1'></sc
'''

if __name__ == '__main__':

    print("%-20s: %s" % ('Python',           sys.version_info))
    print("%-20s: %s" % ('lxml.etree',       ET.LXML_VERSION))
    print("%-20s: %s" % ('libxml used',      ET.LIBXML_VERSION))
    print("%-20s: %s" % ('libxml compiled',  ET.LIBXML_COMPILED_VERSION))
    print("%-20s: %s" % ('libxslt used',     ET.LIBXSLT_VERSION))
    print("%-20s: %s" % ('libxslt compiled', ET.LIBXSLT_COMPILED_VERSION))

    n = 0
    while True:
        n += 1
        print '-', n
        parser = LH.HTMLParser()
        parser.feed(data)
        root = parser.close()
        m = root.xpath('//meta[@property]')
        for i in m:
            print(n)

yields

% test.py
Python              : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
lxml.etree          : (2, 3, 0, 0)
libxml used         : (2, 7, 8)
libxml compiled     : (2, 7, 8)
libxslt used        : (1, 1, 26)
libxslt compiled    : (1, 1, 26)
- 1
- 2
- 3
- 4
- 5
...

回答2:

It's nothing to do with lxml.html - check:

tree = lxml.html.fromstring( data )
print tree
# <Element html at 0x1bb5530>
print tree.xpath("//meta[@property]")
# []

Instead, look at this part....Where you effectively have an infinite loop:

n = 0
while True:
    n += 1
    m = [] # never mind if you get results or not - looks like you don't though

    for i in m:
        print n

来源：https://stackoverflow.com/questions/15367001/how-to-prevent-lxml-etree-html-data-from-crashing-on-certain-type-of-data

标签

python

html

debugging

lxml