BeautifulSoup: Scraping different data sets having same set of attributes in the source code

浪子不回头ぞ 提交于 2019-12-24 23:24:34

问题


I'm using the BeautifulSoup module for scraping the total number of followers and total number of tweets from a Twitter account. However, when I tried inspecting the elements of the respective fields on the web page, I found that both the fields are enclosed inside same set of html attributes:

Followers

<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor" data-nav="followers" href="/IAmJericho/followers" data-original-title="2,469,681 Followers">
          <span class="ProfileNav-label">Followers</span>
          <span class="ProfileNav-value" data-is-compact="true">2.47M</span>
</a>

Tweet count

    <a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav" data-nav="tweets" tabindex="0" data-original-title="21,769 Tweets">
                <span class="ProfileNav-label">Tweets</span>
                <span class="ProfileNav-value" data-is-compact="true">21.8K</span>
</a>

The mining script that I wrote:

import requests
import urllib2
from bs4 import BeautifulSoup

link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for e in res.findAll('span', {'data-is-compact':'true'}):
    followers = e.text

print followers 

However, since the values of both, the total tweet count and total number of followers are enclosed inside same set of HTML attributes, ie inside a span tag with class = "ProfileNav-value" and data-is-compact = "true", I only get the results of the total number of followers returned by running the above script.

How could I possibly extract two sets of information enclosed inside similar HTML attributes from BeautifulSoup?


回答1:


In this case, one way to achieve it, is to check that data-is-compact="true" only appears twice for each piece of data you want to extract, and also you know that tweets is first and followers second, so you can have a list with those titles in same order and use a zip to join them in a tuple to print both at same time, like:

import urllib2
from bs4 import BeautifulSoup

profile = ['Tweets', 'Followers']

link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for p, d in zip(profile, res.find_all('span', { 'data-is-compact': "true"})):
    print p, d.text

It yields:

Tweets 21,8K                                                                                                                                                                                                                                                                   
Followers 2,47M


来源:https://stackoverflow.com/questions/31225803/beautifulsoup-scraping-different-data-sets-having-same-set-of-attributes-in-the

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!