I am fetching data from a web page using urllib2. The content of all the pages is in the English language so there is no issue of dealing with non-English text. The pages ar
Case-insensitive string search is more complicated than simply searching in the lower-cased variant. For example, a German user would expect to match both STRASSE as well as Straße with the search term Straße, but 'STRASSE'.lower() == 'strasse' (and you can't simply replace a double s with ß - there's no ß in Trasse). Other languages (in particular Turkish) will have similar complications as well.
If you're looking to support other languages than English, you should therefore use a library that can handle proper casefolding (such as Matthew Barnett's regexp).
That being said, the way to extract the page's content is:
import contextlib
def get_page_content(url):
with contextlib.closing(urllib2.urlopen(url)) as uh:
content = uh.read().decode('utf-8')
return content
# You can call .lower() on the result, but that won't work in general