Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

后端未结

关注

 3  604

I am fetching data from a web page using urllib2. The content of all the pages is in the English language so there is no issue of dealing with non-English text. The pages ar

相关标签:

3条回答

温柔的废话

2020-12-06 23:46
Or with Requests:
```
page_text = requests.get(url).text
lowercase_text = page_text.lower()
```
(Requests will automatically decode the response.)

As @tchrist says, .lower() will not do the job for unicode text.

You could check out this alternative regex implementation which implements case folding for unicode case insensitive comparison: http://code.google.com/p/mrab-regex-hg/

There are also casefolding tables available: http://unicode.org/Public/UNIDATA/CaseFolding.txt
0 讨论(0)
发布评论:

提交评论
- 加载中...

醉酒成梦

2020-12-06 23:53

BeautifulSoup stores data as Unicode internally so you don't need to perform character encoding manipulations manually.

To find keywords (case-insensitive) in a text (not in attribute values, or tag names):

#!/usr/bin/env python
import urllib2
from contextlib import closing 

import regex # pip install regex
from BeautifulSoup import BeautifulSoup

with closing(urllib2.urlopen(URL)) as page:
     soup = BeautifulSoup(page)
     print soup(text=regex.compile(ur'(?fi)\L<keywords>',
                                   keywords=['your', 'keywords', 'go', 'here']))

Example (Unicode words by @tchrist)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex
from BeautifulSoup import BeautifulSoup, Comment

html = u'''<div attr="PoSt in attribute should not be found">
<!-- it must not find post inside a comment either -->
<ol> <li> tag names must not match
<li> Post will be found
<li> the same with post
<li> and poﬆ
<li> and poﬅ
<li> this is ignored
</ol>
</div>'''

soup = BeautifulSoup(html)

# remove comments
comments = soup.findAll(text=lambda t: isinstance(t, Comment))
for comment in comments: comment.extract()

# find text with keywords (case-insensitive)
print ''.join(soup(text=regex.compile(ur'(?fi)\L<opts>', opts=['post', 'li'])))
# compare it with '.lower()'
print '.lower():'
print ''.join(soup(text=lambda t: any(k in t.lower() for k in ['post', 'li'])))
# or exact match
print 'exact match:'
print ''.join(soup(text=' the same with post\n'))

Output

 Post will be found
 the same with post
 and poﬆ
 and poﬅ

.lower():
 Post will be found
 the same with post

exact match:
 the same with post

0 讨论(0)

广开言路

2020-12-06 23:59
Case-insensitive string search is more complicated than simply searching in the lower-cased variant. For example, a German user would expect to match both STRASSE as well as Straße with the search term Straße, but 'STRASSE'.lower() == 'strasse' (and you can't simply replace a double s with ß - there's no ß in Trasse). Other languages (in particular Turkish) will have similar complications as well.

If you're looking to support other languages than English, you should therefore use a library that can handle proper casefolding (such as Matthew Barnett's regexp).

That being said, the way to extract the page's content is:
```
import contextlib
def get_page_content(url):
  with contextlib.closing(urllib2.urlopen(url)) as uh:
    content = uh.read().decode('utf-8')
  return content
  # You can call .lower() on the result, but that won't work in general
```
0 讨论(0)
发布评论:

提交评论
- 加载中...