I'm trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I'm given. They have a variety of casings for the tag and attributes, which means I need to work case-insensitively. I can't believe that the lxml authors are as stubborn as to insist on full forced standards-compliance when it excludes much of the use of their library.
I'd like to be able to say doc.cssselect('meta[name=description]')
(or some XPath equivalent) but this will not catch <meta name="Description" Content="...">
tags due othe captial D.
I'm currently using this as a workaround, but it's horrible!
for meta in doc.cssselect('meta'):
name = meta.get('name')
content = meta.get('content')
if name and content:
if name.lower() == 'keywords':
keywords = content
if name.lower() == 'description':
description = content
It seems that the tag name meta
is treated case-insensitively, but the attributes are not. It would be even more annoying meta
was case-sensitive too!
Values of attributes must be case-sensitive.
You can use arbitrary regular expression to select an element:
#!/usr/bin/env python
from lxml import html
doc = html.fromstring('''
<meta name="Description">
<meta name="description">
<META name="description">
<meta NAME="description">
''')
for meta in doc.xpath('//meta[re:test(@name, "^description$", "i")]',
namespaces={"re": "http://exslt.org/regular-expressions"}):
print html.tostring(meta, pretty_print=True),
Output:
<meta name="Description">
<meta name="description">
<meta name="description">
<meta name="description">
lxml is an XML parser. XML is case-sensitive. You are parsing HTML, so you should use an HTML parser. BeautifulSoup is very popular. It's only drawback is that it can be slow.
You can use
doc.cssselect.xpath("//meta[translate(@name,
'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz')='description']")
It translates the value of "name" to lowercase and then matches.
See also:
来源:https://stackoverflow.com/questions/1734125/is-it-possible-for-lxml-to-work-in-a-case-insensitive-manner