问题
I'm using Python 2.7, requests
& BeautifulSoup to scrape approximately 50 Wikipedia pages. I've created a column in my dataframe that has partial URL's that relate to the name of each song (these have been verified previously and I'm getting response code 200 when testing against all of them).
My code loops through and appends these individual URL's to the main Wikipedia URL. I've been able to get the heading of the page or other data, but what I really want is the Length of the song only (don't need everything else). The song length is contained within an infobox (example here: http://en.wikipedia.org/wiki/No_One_Knows)
My code either drags through everything on the page or nothing at all. I think that the main problem is the bit where I have underlined below (i.e. mt = ...) - I put different html tags in here but I either get nothing back or most of the page.
xyz = df.lengthlink
#column in a dataframe containing partial strings to append to the main Wikipedia url
def songlength():
url = ('http://en.wikipedia.org/wiki/' + xyz)
resp = requests.get(url)
page = resp.content
take = BeautifulSoup(page)
mt = take.find_all(____________)
sign = mt
return xyz, sign
for xyz in df.lengthlink:
print songlength()
Edited to Add: Using Martijn's suggestion below worked for the single url (i.e. No_One_Knows) but not for my multiple links. It threw up this random error.
InvalidSchema Traceback (most recent call last)
<ipython-input-166-b5a10522aa27> in <module>()
2 xyz = df.lengthlink
3 url = 'http://en.wikipedia.org/wiki/' + xyz
----> 4 resp = requests.get(url, params={'action': 'raw'})
5 page = resp.text
6
C:\Python27\lib\site-packages\requests\api.pyc in get(url, **kwargs)
63
64 kwargs.setdefault('allow_redirects', True)
---> 65 return request('get', url, **kwargs)
66
67
C:\Python27\lib\site-packages\requests\api.pyc in request(method, url, **kwargs)
47
48 session = sessions.Session()
---> 49 response = session.request(method=method, url=url, **kwargs)
50 # By explicitly closing the session, we avoid leaving sockets open which
51 # can trigger a ResourceWarning in some cases, and look like a memory leak
C:\Python27\lib\site-packages\requests\sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
459 }
460 send_kwargs.update(settings)
--> 461 resp = self.send(prep, **send_kwargs)
462
463 return resp
C:\Python27\lib\site-packages\requests\sessions.pyc in send(self, request, **kwargs)
565
566 # Get the appropriate adapter to use
--> 567 adapter = self.get_adapter(url=request.url)
568
569 # Start time (approximately) of the request
C:\Python27\lib\site-packages\requests\sessions.pyc in get_adapter(self, url)
644
645 # Nothing matches :-/
--> 646 raise InvalidSchema("No connection adapters were found for '%s'" % url)
647
648 def close(self):
InvalidSchema: No connection adapters were found for '1 http://en.wikipedia.org/wiki/Locked_Out_of_Heaven
2 http://en.wikipedia.org/wiki/No_One_Knows
3 http://en.wikipedia.org/wiki/Given_to_Fly
4 http://en.wikipedia.org/wiki/Nothing_as_It_Seems
Name: lengthlink, Length: 50, dtype: object'
回答1:
Rather than try and parse the HTML output, try and parse the raw MediaWiki source for the page; the first line that starts with | Length
contains the information you are looking for:
url = 'http://en.wikipedia.org/wiki/' + xyz
resp = requests.get(url, params={'action': 'raw'})
page = resp.text
for line in page.splitlines():
if line.startswith('| Length'):
length = line.partition('=')[-1].strip()
break
Demo:
>>> import requests
>>> xyz = 'No_One_Knows'
>>> url = 'http://en.wikipedia.org/wiki/' + xyz
>>> resp = requests.get(url, params={'action': 'raw'})
>>> page = resp.text
>>> for line in page.splitlines():
... if line.startswith('| Length'):
... length = line.partition('=')[-1].strip()
... break
...
>>> print length
4:13 <small>(Radio edit)</small><br />4:38 <small>(Album version)</small>
You can further process this to extract the richer data here (the Radio edit vs. *Album version) as required.
来源:https://stackoverflow.com/questions/29725163/scraping-part-of-a-wikipedia-infobox