Scraping Instagram with BeautifulSoup

你离开我真会死。 提交于 2019-12-25 01:34:42

问题


I'm trying to get a particular string from the "search by tag" in Instagram. I'd like to get the url img from here:

<img alt="#yeşil  #manzara #doğa  
#yayla #nature #naturelovers #adventuretime #adventures #mountainstaries 
#picture #şehirdenuzak  #tatil #holiday #cow  #potography #view #kütükev 
#naturelife #animal #amazing  #kar #winter #winteriscomming #mapavr1 #artvin 
#tulumile #insaatr #tulumci #rize 
class="_2di5p" sizes="171px" srcset="https://scontent-mxp11.cdninstagram.com/vp/c883e0c4267c003843fafeda255f1329/5A9D3C97/t51.2885-15/s150x150/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 150w,
https://scontent-mxp1-1.cdninstagram.com/vp/6a3480f8658b50c691bcc100a96cc6f0/5A9CC9DC/t51.2885-15/s240x240/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 240w,
https://scontent-mxp1-1.cdninstagram.com/vp/461c138e15f52420c3fbc075fab027eb/5A9DD808/t51.2885-15/s320x320/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 320w,
https://scontent-mxp1-1.cdninstagram.com/vp/ad5d67f1c9ea77d78d145501e73c2ea0/5A9CAF9D/t51.2885-15/s480x480/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 480w,
https://scontent-mxp1-1.cdninstagram.com/vp/e0636f79adc1ae53f7321d10fe60f275/5A9CD134/t51.2885-15/s640x640/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 640w" 
src="https://scontent-mxp1-1.cdninstagram.com/vp/e0636f79adc1ae53f7321d10fe60f275/5A9CD134/t51.2885-15/s640x640/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg" style="">

so basically I wolud like to get this string (That is the one with 240w at the end):

https://scontent-mxp1-1.cdninstagram.com/vp/6a3480f8658b50c691bcc100a96cc6f0/../n.jpg

and I tried writing this code with Python but it doesn't work

import requests
from bs4 import BeautifulSoup

request = requests.get("https://www.instagram.com/explore/tags/nature/")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("srcset")
print(element.text.strip())

maybe the real problem is that there are 21 elements like this one in the page but to start I'd like to understand how to get that string.

(And, if any of you know a good tutorial or book for bs4 can you tell me?)


回答1:


The reason you can't see any output is that the images are added dynamically to the page source using JavaScript. So, the HTML that you've provided isn't available in the page source. Easiest way to overcome this is to use Selenium.

But, there's one more way to scrape that. Looking at the page source, the data you're after, is available in a <script> tag in the form of JSON. The relevant data is in the form of:

"thumbnail_resources": [
    {
        "src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/a3ed0ee1af581f1c1fe6170b8c080e7c/5B2CA660/t51.2885-15/s150x150/e35/28433503_571483933190064_5347634166450094080_n.jpg",
         "config_width": 150,
         "config_height": 150
     },
     {
         "src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/7a0bb4fb1b5d5e3b179c58a2b9472b9f/5B2C535F/t51.2885-15/s240x240/e35/28433503_571483933190064_5347634166450094080_n.jpg",
         "config_width": 240,
         "config_height": 240
     },

To get the JSON, you can use this (code taken from this answer):

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)

Code to get image link for all the images:

import json
import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.instagram.com/explore/tags/nature/')
soup = BeautifulSoup(r.text, 'lxml')

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)

for post in data['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']:
    image_src = post['node']['thumbnail_resources'][1]['src']
    print(image_src)

Partial output:

https://instagram.fpnq3-1.fna.fbcdn.net/vp/e8a78407fb61de834cad7f10eca830fc/5A9DC375/t51.2885-15/s240x240/e15/c0.80.640.640/28766397_174603559842180_1092148752455565312_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/3a20f36647c86c2196f259b5d14ebf82/5A9D5BC9/t51.2885-15/s240x240/e15/28433802_283862648812409_3322859933120069632_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/82216be4596dd9da862ba267cdeab517/5B144226/t51.2885-15/s240x240/e35/c0.135.1080.1080/28157436_941679549319762_5605299824451649536_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/e50eab90b2e0951d67922e49b495e1fc/5B3EC9B8/t51.2885-15/s240x240/e35/c135.0.810.810/28754107_179533402825352_1137703808411893760_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/d3a13e7b81a65421b4318b57fb8ee24e/5B4D9EFF/t51.2885-15/s240x240/e35/28433583_375555202918683_1951892035636035584_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/1b0aeea1b9be983498192d350e039aa0/5B43C583/t51.2885-15/s240x240/e35/28156427_154249191953160_9219472301039288320_n.jpg
...

Note: The [1] in the line image_src = post['node']['thumbnail_resources'][1]['src'] is for 240w. You can use 0, 1, 2, 3 or 4 for 150w, 240w, 320w, 480w or 640w respectively. Also, if you want any other data regarding any image, like, number of likes, comments, caption, etc; everything is available in this JSON (data variable).



来源:https://stackoverflow.com/questions/49085421/scraping-instagram-with-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!