Find all the span styles with font size larger than the most common one via beautiful soup python

元气小坏坏 提交于 2019-12-02 11:22:37

This may help you:-

    from bs4 import BeautifulSoup
    import re

    usedFontSize = [] #list of all font number used

    #Find all the span contains style 
    spans = soup.find_all('span',style=True)
    for span in spans:
        #print span['style']
        styleTag = span['style']
        fontSize = re.findall("font-size:(\d+)px",styleTag)
        usedFontSize.append(int(fontSize[0]))

    #Find most commanly used font size
    from collections import Counter
    count = Counter(usedFontSize)
    #Print list of all the font size with it's accurence.
    print count.most_common()

To find all the span styles with font sizes larger than the most common span style using BeautifulSoup, you need to parse each CSS style that has been returned.

Parsing CSS is better done using a library such as cssutils. This would then let you access the fontSize attribute directly.

This would have a value such as 12px which does not naturally sort correctly. To get around this, you could use a library such as natsort.

So, first parse each of the styles into css objects. At the same time keep a list of all the soup for each span, along with the parsed CSS for the style.

Now use the fontSize attribute as the key for sorting with natsort. This would give you a correctly sorted list of styles according to their font size, largest first (by using reverse=True). takewhile() is then used to create a list of all entries in the list up to the point where the size matches the most common one resulting in a list of entries larger than the most common one.

from bs4 import BeautifulSoup
from collections import Counter
from itertools import takewhile    
import cssutils
import natsort

html = """
    <span style="font-family: ArialMT; font-size:12px">1</span>
    <span style="font-family: ArialMT; font-size:14px">2</span>
    <span style="font-family: ArialMT; font-size:1px">3</span>
    <span style="font-family: Arial; font-size:12px">4</span>
    <span style="font-family: ArialMT; font-size:18px">5</span>
    <span style="font-family: ArialMT; font-size:15px">6</span>
    <span style="font-family: ArialMT; font-size:12px">7</span>
    """

soup = BeautifulSoup(html, "html.parser")    
style_counts = Counter()
parsed_css_style = []       # Holds list of tuples (css_style, span)

for span in soup.find_all('span', style=True):
    style_counts[span['style']] += 1
    parsed_css_style.append((cssutils.parseStyle(span['style']), span))

most_common_style = style_counts.most_common(1)[0][0]
most_common_css_style = cssutils.parseStyle(most_common_style)
css_styles = natsort.natsorted(parsed_css_style, key=lambda x: x[0].fontSize, reverse=True)

print "Styles larger than most common font size of {} are:".format(most_common_css_style.fontSize)

for css_style, span in takewhile(lambda x: x[0].fontSize != most_common_css_style.fontSize, css_styles):
    print "  Font size: {:5}  Text: {}".format(css_style.fontSize, span.text)

In the example shown, the most commonly used font size is 12px, so there are 3 other entries larger than this as follows:

Styles larger than most common font size of 12px are:
  Font size: 18px   Text: 5
  Font size: 15px   Text: 6
  Font size: 14px   Text: 2

To install you will probably need:

pip install natsort
pip install cssutils    

Note, this does assume the font sizes used are consistent on your website, it is not able to compare different font metrics, only the numerical value.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!