问题
I am trying to make a web crawler that picks the interest of the people. Here is the code:
import requests
from bs4 import BeautifulSoup
def facebook_spider():
url = 'https://www.facebook.com/abhas.mittal7'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text , "html.parser")
for div in soup.findAll('div', attrs={'class':'mediaRowWrapper'}):
print div.text
facebook_spider()
It is not showing any results. However if I type in a different class of div (the divs that are at the top of the page) then it shows the content. I think there is some problem with the nested divs but I have tried this code in sample html page with too many nested divs, it worked. Kindly help.
回答1:
See if this works:
import urlparse,urllib,codecs
from bs4 import BeautifulSoup
url = 'https://www.facebook.com/abhas.mittal7'
html=urllib.urlopen(url)
htmltext=html.read()
def gettext(htmltext):
soup=BeautifulSoup(htmltext)
for script in soup(["script", "style"]):
script.extract()#removing styles and scripts
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
# return text.encode('utf-8') #or print it or whatever you see fit
gettext(htmltext)
来源:https://stackoverflow.com/questions/32309945/web-crawler-not-working-in-nested-divs