I am fetching data from a web page using urllib2. The content of all the pages is in the English language so there is no issue of dealing with non-English text. The pages ar
Or with Requests:
page_text = requests.get(url).text
lowercase_text = page_text.lower()
(Requests will automatically decode the response.)
As @tchrist says, .lower()
will not do the job for unicode text.
You could check out this alternative regex implementation which implements case folding for unicode case insensitive comparison: http://code.google.com/p/mrab-regex-hg/
There are also casefolding tables available: http://unicode.org/Public/UNIDATA/CaseFolding.txt
BeautifulSoup stores data as Unicode internally so you don't need to perform character encoding manipulations manually.
To find keywords (case-insensitive) in a text (not in attribute values, or tag names):
#!/usr/bin/env python
import urllib2
from contextlib import closing
import regex # pip install regex
from BeautifulSoup import BeautifulSoup
with closing(urllib2.urlopen(URL)) as page:
soup = BeautifulSoup(page)
print soup(text=regex.compile(ur'(?fi)\L<keywords>',
keywords=['your', 'keywords', 'go', 'here']))
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex
from BeautifulSoup import BeautifulSoup, Comment
html = u'''<div attr="PoSt in attribute should not be found">
<!-- it must not find post inside a comment either -->
<ol> <li> tag names must not match
<li> Post will be found
<li> the same with post
<li> and post
<li> and poſt
<li> this is ignored
</ol>
</div>'''
soup = BeautifulSoup(html)
# remove comments
comments = soup.findAll(text=lambda t: isinstance(t, Comment))
for comment in comments: comment.extract()
# find text with keywords (case-insensitive)
print ''.join(soup(text=regex.compile(ur'(?fi)\L<opts>', opts=['post', 'li'])))
# compare it with '.lower()'
print '.lower():'
print ''.join(soup(text=lambda t: any(k in t.lower() for k in ['post', 'li'])))
# or exact match
print 'exact match:'
print ''.join(soup(text=' the same with post\n'))
Post will be found
the same with post
and post
and poſt
.lower():
Post will be found
the same with post
exact match:
the same with post
Case-insensitive string search is more complicated than simply searching in the lower-cased variant. For example, a German user would expect to match both STRASSE
as well as Straße
with the search term Straße
, but 'STRASSE'.lower() == 'strasse'
(and you can't simply replace a double s with ß - there's no ß in Trasse). Other languages (in particular Turkish) will have similar complications as well.
If you're looking to support other languages than English, you should therefore use a library that can handle proper casefolding (such as Matthew Barnett's regexp).
That being said, the way to extract the page's content is:
import contextlib
def get_page_content(url):
with contextlib.closing(urllib2.urlopen(url)) as uh:
content = uh.read().decode('utf-8')
return content
# You can call .lower() on the result, but that won't work in general