Using Python, Selenium, and BeautifulSoup to scrape for content of a tag?

问题

Relatively beginner. There are similar topics to this but I can see how my solution works, I just need help connecting these last few dots. I'd like to scrape follower counts from Instagram without the use of the API. Here's what I have so far:

Python 3.7.0
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()

> DevTools listening on ws://.......

driver.get("https://www.instagram.com/cocacola")
soup = BeautifulSoup(driver.page_source)
elements = soup.find_all(attrs={"class":"g47SY "}) 
# Note the full class is 'g47SY lOXF2' but I can't get this to work
for element in elements:
    print(element)

>[<span class="g47SY ">667</span>,
  <span class="g47SY " title="2,598,456">2.5m</span>, # Need what's in title, 2,598,456
  <span class="g47SY ">582</span>]

for element in elements:
    t = element.get('title')
    if t:
        count = t
        count = count.replace(",","")
    else:
        pass

print(int(count))

>2598456 # Success

Is there any easier, or quicker way to get to the 2,598,456 number? My original hope was that I could just use the class of 'g47SY lOXF2' but spaces in the class name aren't functional in BS4 as far as I'm aware. Just want to make sure this code is succinct and functional.

回答1:

I had to use headless option and added executable_path for testing. You can remove that.

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="chromedriver.exe",chrome_options=options)

driver.get('https://www.instagram.com/cocacola')

soup = BeautifulSoup(driver.page_source,'lxml')

#This will give you span that has title attribute. But it gives us multiple results
#Follower count is in the inner of a tag.
followers = soup.select_one('a > span[title]')['title'].replace(',','')

print(followers)
#Output 2598552

回答2:

You could use regular expression to get the number. Try this:

import re

fallowerRegex = re.compile(r'title="((\d){1,3}(,)?)+')
fallowerCount = fallowerRegex.search(str(elements))
result = fallowerCount.group().strip('title="').replace(',','')

来源：https://stackoverflow.com/questions/51868356/using-python-selenium-and-beautifulsoup-to-scrape-for-content-of-a-tag

标签

python-3.x

beautifulsoup

selenium-chromedriver