How to scrape Instagram with BeautifulSoup

梦想与她 提交于 2021-01-16 08:00:12

问题


I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right?

Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the code I pulled from requests.

Any idea how I can get the code that shows up in element inspector?

Just for the record, this was my code to start, which didn't work because the unordered list was not there:

from bs4 import BeautifulSoup
import requests
import re

r = requests.get('http://instagram.com/umnpics/')
soup = BeautifulSoup(r.text)
for x in soup.findAll('li', {'class':'photo'}):
    print x

Thank you for your help.


回答1:


If you look at the source code for the page, you'll see that some javascript generates the webpage. What you see in the element browser is the webpage after the script has been run, and beautifulsoup just gets the html file. In order to parse the rendered webpage you'll need to use something like Selenium to render the webpage for you.

So, for example, this is how it would look with Selenium:

from bs4 import BeautifulSoup
import selenium.webdriver as webdriver

url = 'http://instagram.com/umnpics/'
driver = webdriver.Firefox()
driver.get(url)

soup = BeautifulSoup(driver.page_source)

for x in soup.findAll('li', {'class':'photo'}):
    print x

Now the soup should be what you are expecting.



来源:https://stackoverflow.com/questions/18130499/how-to-scrape-instagram-with-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!