Using Python, Selenium, and BeautifulSoup to scrape for content of a tag?

别说谁变了你拦得住时间么 提交于 2019-12-08 13:27:17

问题


Relatively beginner. There are similar topics to this but I can see how my solution works, I just need help connecting these last few dots. I'd like to scrape follower counts from Instagram without the use of the API. Here's what I have so far:

Python 3.7.0
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()

> DevTools listening on ws://.......

driver.get("https://www.instagram.com/cocacola")
soup = BeautifulSoup(driver.page_source)
elements = soup.find_all(attrs={"class":"g47SY "}) 
# Note the full class is 'g47SY lOXF2' but I can't get this to work
for element in elements:
    print(element)

>[<span class="g47SY ">667</span>,
  <span class="g47SY " title="2,598,456">2.5m</span>, # Need what's in title, 2,598,456
  <span class="g47SY ">582</span>]

for element in elements:
    t = element.get('title')
    if t:
        count = t
        count = count.replace(",","")
    else:
        pass

print(int(count))

>2598456 # Success

Is there any easier, or quicker way to get to the 2,598,456 number? My original hope was that I could just use the class of 'g47SY lOXF2' but spaces in the class name aren't functional in BS4 as far as I'm aware. Just want to make sure this code is succinct and functional.


回答1:


I had to use headless option and added executable_path for testing. You can remove that.

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="chromedriver.exe",chrome_options=options)

driver.get('https://www.instagram.com/cocacola')

soup = BeautifulSoup(driver.page_source,'lxml')

#This will give you span that has title attribute. But it gives us multiple results
#Follower count is in the inner of a tag.
followers = soup.select_one('a > span[title]')['title'].replace(',','')

print(followers)
#Output 2598552



回答2:


You could use regular expression to get the number. Try this:

import re

fallowerRegex = re.compile(r'title="((\d){1,3}(,)?)+')
fallowerCount = fallowerRegex.search(str(elements))
result = fallowerCount.group().strip('title="').replace(',','')


来源:https://stackoverflow.com/questions/51868356/using-python-selenium-and-beautifulsoup-to-scrape-for-content-of-a-tag

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!