问题
I can scrape the page to the headlines, no problem. The URLs are another story. They are fragments that get appended on the end of the base URL - I understand that... What do I need to pull the related URLs for storage in format - base_url.scraped_fragment
from urllib2 import urlopen
import requests
from bs4 import BeautifulSoup
import csv
import MySQLdb
import re
html = urlopen("http://advances.sciencemag.org/")
soup = BeautifulSoup(html.read().decode('utf-8'),"lxml")
#links = soup.findAll("a","href")
headlines = soup.findAll("div", "highwire-cite-title media__headline__title")
for headline in headlines:
text = (headline.get_text())
print text
回答1:
First of all, there should be a space between the class names:
highwire-cite-title media__headline__title
HERE^
Anyway, since you need the links, you should be locating the a elements and use urljoin() to make absolute URLs:
from urlparse import urljoin
import requests
from bs4 import BeautifulSoup
base_url = "http://advances.sciencemag.org"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")
headlines = soup.find_all(class_="highwire-cite-linked-title")
for headline in headlines:
print(urljoin(base_url, headline["href"]))
Prints:
http://advances.sciencemag.org/content/2/4/e1600069
http://advances.sciencemag.org/content/2/4/e1501914
http://advances.sciencemag.org/content/2/4/e1501737
...
http://advances.sciencemag.org/content/2/2
http://advances.sciencemag.org/content/2/1
来源:https://stackoverflow.com/questions/36800951/scraping-a-page-for-urls-using-beautifulsoup