Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

后端 未结 3 604
甜味超标
甜味超标 2020-12-06 23:24

I am fetching data from a web page using urllib2. The content of all the pages is in the English language so there is no issue of dealing with non-English text. The pages ar

相关标签:
3条回答
  • 2020-12-06 23:46

    Or with Requests:

    page_text = requests.get(url).text
    lowercase_text = page_text.lower()
    

    (Requests will automatically decode the response.)

    As @tchrist says, .lower() will not do the job for unicode text.

    You could check out this alternative regex implementation which implements case folding for unicode case insensitive comparison: http://code.google.com/p/mrab-regex-hg/

    There are also casefolding tables available: http://unicode.org/Public/UNIDATA/CaseFolding.txt

    0 讨论(0)
  • 2020-12-06 23:53

    BeautifulSoup stores data as Unicode internally so you don't need to perform character encoding manipulations manually.

    To find keywords (case-insensitive) in a text (not in attribute values, or tag names):

    #!/usr/bin/env python
    import urllib2
    from contextlib import closing 
    
    import regex # pip install regex
    from BeautifulSoup import BeautifulSoup
    
    with closing(urllib2.urlopen(URL)) as page:
         soup = BeautifulSoup(page)
         print soup(text=regex.compile(ur'(?fi)\L<keywords>',
                                       keywords=['your', 'keywords', 'go', 'here']))
    

    Example (Unicode words by @tchrist)

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    import regex
    from BeautifulSoup import BeautifulSoup, Comment
    
    html = u'''<div attr="PoSt in attribute should not be found">
    <!-- it must not find post inside a comment either -->
    <ol> <li> tag names must not match
    <li> Post will be found
    <li> the same with post
    <li> and post
    <li> and poſt
    <li> this is ignored
    </ol>
    </div>'''
    
    soup = BeautifulSoup(html)
    
    # remove comments
    comments = soup.findAll(text=lambda t: isinstance(t, Comment))
    for comment in comments: comment.extract()
    
    # find text with keywords (case-insensitive)
    print ''.join(soup(text=regex.compile(ur'(?fi)\L<opts>', opts=['post', 'li'])))
    # compare it with '.lower()'
    print '.lower():'
    print ''.join(soup(text=lambda t: any(k in t.lower() for k in ['post', 'li'])))
    # or exact match
    print 'exact match:'
    print ''.join(soup(text=' the same with post\n'))
    

    Output

     Post will be found
     the same with post
     and post
     and poſt
    
    .lower():
     Post will be found
     the same with post
    
    exact match:
     the same with post
    
    0 讨论(0)
  • 2020-12-06 23:59

    Case-insensitive string search is more complicated than simply searching in the lower-cased variant. For example, a German user would expect to match both STRASSE as well as Straße with the search term Straße, but 'STRASSE'.lower() == 'strasse' (and you can't simply replace a double s with ß - there's no ß in Trasse). Other languages (in particular Turkish) will have similar complications as well.

    If you're looking to support other languages than English, you should therefore use a library that can handle proper casefolding (such as Matthew Barnett's regexp).

    That being said, the way to extract the page's content is:

    import contextlib
    def get_page_content(url):
      with contextlib.closing(urllib2.urlopen(url)) as uh:
        content = uh.read().decode('utf-8')
      return content
      # You can call .lower() on the result, but that won't work in general
    
    0 讨论(0)
提交回复
热议问题