Is it possible for lxml to work in a case-insensitive manner?

后端未结

关注

 3  1591

I\'m trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I\'m given. They have a

相关标签:

3条回答

孤独总比滥情好

2020-12-10 06:30

lxml is an XML parser. XML is case-sensitive. You are parsing HTML, so you should use an HTML parser. BeautifulSoup is very popular. It's only drawback is that it can be slow.

0 讨论(0)
发布评论:

提交评论
- 加载中...

迷失自我

2020-12-10 06:31

Values of attributes must be case-sensitive.

You can use arbitrary regular expression to select an element:

#!/usr/bin/env python
from lxml import html

doc = html.fromstring('''
    <meta name="Description">
    <meta name="description">
    <META name="description">
    <meta NAME="description">
''')
for meta in doc.xpath('//meta[re:test(@name, "^description$", "i")]',
                      namespaces={"re": "http://exslt.org/regular-expressions"}):
    print html.tostring(meta, pretty_print=True),

Output:

<meta name="Description">
<meta name="description">
<meta name="description">
<meta name="description">

0 讨论(0)

渐次进展

2020-12-10 06:48
You can use
```
doc.cssselect.xpath("//meta[translate(@name,
    'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz')='description']")
```
It translates the value of "name" to lowercase and then matches.

See also:
- XPath: How do you do a lowercase call in xpath
- Xpath translation function turns things into lowercase?
0 讨论(0)
发布评论:

提交评论
- 加载中...