Is it possible for lxml to work in a case-insensitive manner?

后端 未结 3 1584
野的像风
野的像风 2020-12-10 05:47

I\'m trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I\'m given. They have a

相关标签:
3条回答
  • 2020-12-10 06:30

    lxml is an XML parser. XML is case-sensitive. You are parsing HTML, so you should use an HTML parser. BeautifulSoup is very popular. It's only drawback is that it can be slow.

    0 讨论(0)
  • 2020-12-10 06:31

    Values of attributes must be case-sensitive.

    You can use arbitrary regular expression to select an element:

    #!/usr/bin/env python
    from lxml import html
    
    doc = html.fromstring('''
        <meta name="Description">
        <meta name="description">
        <META name="description">
        <meta NAME="description">
    ''')
    for meta in doc.xpath('//meta[re:test(@name, "^description$", "i")]',
                          namespaces={"re": "http://exslt.org/regular-expressions"}):
        print html.tostring(meta, pretty_print=True),
    

    Output:

    <meta name="Description">
    <meta name="description">
    <meta name="description">
    <meta name="description">
    
    0 讨论(0)
  • 2020-12-10 06:48

    You can use

    doc.cssselect.xpath("//meta[translate(@name,
        'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz')='description']")
    

    It translates the value of "name" to lowercase and then matches.

    See also:

    • XPath: How do you do a lowercase call in xpath
    • Xpath translation function turns things into lowercase?
    0 讨论(0)
提交回复
热议问题