Get instagram followers

拟墨画扇 提交于 2019-12-18 07:13:15

问题


I want to parse a website's followers count with BeautifulSoup. This is what I have so far:

username_extract = 'lazada_my'

url = 'https://www.instagram.com/'+ username_extract
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
f = soup.find('head', attrs={'class':'count'})

This is the part I want to parse:

Something within my soup.find() function is wrong, but I can't wrap my head around it. When returning f, it is empty. Any idea what I am doing wrong?


回答1:


I think you can use re module to search the correct count.

import requests
import re

username_extract = 'lazada_my'

url = 'https://www.instagram.com/'+ username_extract
r = requests.get(url)
m = re.search(r'"followed_by":\{"count":([0-9]+)\}', str(r.content))
print(m.group(1))



回答2:


soup.find('head', attrs={'class':'count'}) searches for something that looks like <head class="count">, which doesn't exist anywhere in the HTML. The data you're after is contained in the <script> tag that starts with window._sharedData:

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))

From there, you can just strip off the variable assignment and the semicolon to get valid JSON:

# <script>window._sharedData = ...;</script>
#                              ^^^
#                              JSON

page_json = script.text.split(' = ', 1)[1].rstrip(';')

Parse it and everything you need is contained in the object:

import json

data = json.loads(page_json)
follower_count = data['entry_data']['ProfilePage'][0]['user']['followed_by']['count']



回答3:


Most of the content is dynamically generated with JS. That's the reason you're getting empty results.

But, the followers count is present in the page source. Only thing is, it is not directly available in the form you want. You can see it here:

<meta content="407.4k Followers, 27 Following, 2,740 Posts - See Instagram photos and videos from Lazada Malaysia (@lazada_my)" name="description" />

If you want to scrape the followers count without regex, you can use this:

>>> followers = soup.find('meta', {'name': 'description'})['content']
>>> followers
'407.4k Followers, 27 Following, 2,740 Posts - See Instagram photos and videos from Lazada Malaysia (@lazada_my)'
>>> followers_count = followers.split('Followers')[0]
>>> followers_count
'407.4k '



回答4:


You have to look for the scripts, Then look for the 'window._sharedData' exits in it. If exits then perform the regular expression operation.

import re

username_extract = 'lazada_my'
url = 'https://www.instagram.com/'+ username_extract
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
s = re.compile(r'"followed_by":{"count":\d*}')
for i in soup.find_all('script'):
     if 'window._sharedData' in str(i):
         print s.search(str(i.contents)).group()

Result,

"followed_by":{"count":407426}



回答5:


Thank you all, I ended up using William's solution. In case anybody will have future projects, here is my complete code for scraping a bunch of URL's for their follower count:

import requests
import csv 
import pandas as pd
import re

insta = pd.read_csv('Instagram.csv')

username = []

bad_urls = [] 

for lines in insta['Instagram'][0:250]:
    lines = lines.split("/")
    username.append(lines[3])

with open('insta_output.csv', 'w') as csvfile:
t = csv.writer(csvfile, delimiter=',')     #   ----> COMMA Seperated
for user in username:
   try:
       url = 'https://www.instagram.com/'+ user
       r = requests.get(url)
       m = re.search(r'"followed_by":\{"count":([0-9]+)\}', str(r.content))
       num_followers = m.group(1)
       t.writerow([user,num_followers])    #  ----> Adding Rows
   except:
       bad_urls.append(url)


来源:https://stackoverflow.com/questions/49043857/get-instagram-followers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!