I am trying to extract links for certain section only from Blogspot. But the output shows the codes extract all the link inside the page.
Here is the codes:
import urlparse
import urllib
from bs4 import BeautifulSoup
url = "http://ellywonderland.blogspot.com/"
urls = [url]
visited = [url]
while len(urls) >0:
try:
htmltext = urllib.urlopen(urls[0]).read()
except:
print urls[0]
soup = BeautifulSoup(htmltext)
urls.pop(0)
print len (urls)
for tags in soup.find_all(attrs={'class': "post-title entry-title"}):
for tag in soup.findAll('a',href=True):
tag['href'] = urlparse.urljoin(url,tag['href'])
if url in tag['href'] and tag['href'] not in visited:
urls.append(tag['href'])
visited.append(tag['href'])
print visited
Here is the html codes for section that I want to extract:
<h3 class="post-title entry-title" itemprop="name">
<a href="http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html">Pre-wedding * Vintage*</a>
Thank you.
If you don't necessarily need to use BeautifulSoup I think it would be easier to do something like this:
import feedparser
url = feedparser.parse('http://ellywonderland.blogspot.com/feeds/posts/default?alt=rss')
for x in url.entries:
print str(x.link)
Output:
http://ellywonderland.blogspot.com/2011/03/my-vintage-pre-wedding.html
http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html
http://ellywonderland.blogspot.com/2010/12/tissue-paper-flower-crepe-paper.html
http://ellywonderland.blogspot.com/2010/12/menguap-menurut-islam.html
http://ellywonderland.blogspot.com/2010/12/weddings-idea.html
http://ellywonderland.blogspot.com/2010/12/kawin.html
http://ellywonderland.blogspot.com/2010/11/vitamin-c-collagen.html
http://ellywonderland.blogspot.com/2010/11/port-dickson.html
http://ellywonderland.blogspot.com/2010/11/ellys-world.html
feedparser can parse the RSS feed of the blogspot page and can return the data you want, in this case the href for the post titles.
you need add .get to the object:
print Objecta.get('href')
Example from http://www.crummy.com/software/BeautifulSoup/bs4/doc/:
for link in soup.find_all('a'):
print(link.get('href'))
来源:https://stackoverflow.com/questions/30992225/extract-links-for-certain-section-only-from-blogspot-using-beautifulsoup