Scraping part of a Wikipedia Infobox

问题

I'm using Python 2.7, requests & BeautifulSoup to scrape approximately 50 Wikipedia pages. I've created a column in my dataframe that has partial URL's that relate to the name of each song (these have been verified previously and I'm getting response code 200 when testing against all of them).

My code loops through and appends these individual URL's to the main Wikipedia URL. I've been able to get the heading of the page or other data, but what I really want is the Length of the song only (don't need everything else). The song length is contained within an infobox (example here: http://en.wikipedia.org/wiki/No_One_Knows)

My code either drags through everything on the page or nothing at all. I think that the main problem is the bit where I have underlined below (i.e. mt = ...) - I put different html tags in here but I either get nothing back or most of the page.

xyz = df.lengthlink  
#column in a dataframe containing partial strings to append to the main Wikipedia url

def songlength():
    url = ('http://en.wikipedia.org/wiki/' + xyz)
    resp = requests.get(url)
    page = resp.content
    take = BeautifulSoup(page)
    mt = take.find_all(____________)
    sign = mt
    return xyz, sign

for xyz in df.lengthlink:
    print songlength()

Edited to Add: Using Martijn's suggestion below worked for the single url (i.e. No_One_Knows) but not for my multiple links. It threw up this random error.

InvalidSchema                             Traceback (most recent call last)
<ipython-input-166-b5a10522aa27> in <module>()
      2 xyz = df.lengthlink 
      3 url = 'http://en.wikipedia.org/wiki/' + xyz
----> 4 resp = requests.get(url, params={'action': 'raw'})
      5 page = resp.text
      6 

C:\Python27\lib\site-packages\requests\api.pyc in get(url, **kwargs)
     63 
     64     kwargs.setdefault('allow_redirects', True)
---> 65     return request('get', url, **kwargs)
     66 
     67 

C:\Python27\lib\site-packages\requests\api.pyc in request(method, url,    **kwargs)
     47 
     48     session = sessions.Session()
---> 49     response = session.request(method=method, url=url, **kwargs)
     50     # By explicitly closing the session, we avoid leaving sockets open which
     51     # can trigger a ResourceWarning in some cases, and look like a memory leak

C:\Python27\lib\site-packages\requests\sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    459         }
    460         send_kwargs.update(settings)
--> 461         resp = self.send(prep, **send_kwargs)
    462 
    463         return resp

C:\Python27\lib\site-packages\requests\sessions.pyc in send(self, request, **kwargs)
    565 
    566         # Get the appropriate adapter to use
--> 567         adapter = self.get_adapter(url=request.url)
    568 
    569         # Start time (approximately) of the request

C:\Python27\lib\site-packages\requests\sessions.pyc in get_adapter(self, url)
    644 
    645         # Nothing matches :-/
--> 646         raise InvalidSchema("No connection adapters were found for '%s'" % url)
    647 
    648     def close(self):

InvalidSchema: No connection adapters were found for '1     http://en.wikipedia.org/wiki/Locked_Out_of_Heaven
 2     http://en.wikipedia.org/wiki/No_One_Knows
 3     http://en.wikipedia.org/wiki/Given_to_Fly
 4     http://en.wikipedia.org/wiki/Nothing_as_It_Seems  

Name: lengthlink, Length: 50, dtype: object'

回答1:

Rather than try and parse the HTML output, try and parse the raw MediaWiki source for the page; the first line that starts with | Length contains the information you are looking for:

url = 'http://en.wikipedia.org/wiki/' + xyz
resp = requests.get(url, params={'action': 'raw'})
page = resp.text
for line in page.splitlines():
    if line.startswith('| Length'):
       length = line.partition('=')[-1].strip()
       break

Demo:

>>> import requests
>>> xyz = 'No_One_Knows'
>>> url = 'http://en.wikipedia.org/wiki/' + xyz
>>> resp = requests.get(url, params={'action': 'raw'})
>>> page = resp.text
>>> for line in page.splitlines():
...     if line.startswith('| Length'):
...        length = line.partition('=')[-1].strip()
...        break
... 
>>> print length
4:13 <small>(Radio edit)</small><br />4:38 <small>(Album version)</small>

You can further process this to extract the richer data here (the Radio edit vs. *Album version) as required.

来源：https://stackoverflow.com/questions/29725163/scraping-part-of-a-wikipedia-infobox

标签

python

python-2.7

beautifulsoup

python-requests