问题
I want to scrape the href of every project from the website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 with Python 3.5 and BeautifulSoup.
That's my code
#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")
#Scraping "Link" (href)
project_ref = soup.findAll('h6', {'class': 'project-title'})
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)
I get [None, None, .... None, None] back. I need a list with all the href from the class .
Any ideas?
回答1:
Try something like this:
import urllib.request
from bs4 import BeautifulSoup
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage)
project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)
This will return all the href
instances. As i see in your link, a lot of href
tags have #
inside them. You can avoid these with a simple regex for proper links, or just ignore the #
symboles.
project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]
This will still give you some trash links like /discover?ref=nav
, so if you want to narrow it down use a proper regex for the links you need.
EDIT:
To solve the problem you mentioned in the comments:
soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs={'class' : 'project-card-content'}):
print(i.a['href'])
来源:https://stackoverflow.com/questions/38570411/how-to-scrape-href-with-python-3-5-and-beautifulsoup