Need to extract all the font sizes and the text using beautifulsoup

泪湿孤枕 提交于 2019-12-25 09:12:54

问题


I have the following html file stored on my local system:

<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:71px; width:322px; height:38px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:30px">One
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:104px; width:175px; height:40px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Two</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: two txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Three</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Three txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Four</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Four txt
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:274px; top:144px; width:56px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">FIVE
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:171px; width:515px; height:44px;"><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">five txt
<br>five txt2 
<br>five txt3
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:220px; top:223px; width:164px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">SIX
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:44px; top:247px; width:489px; height:159px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:17px">six txt
<br></span><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">six txt2
<br>- six txt2
<br>• six txt3
<br>• six txt4 
<br>• six txt5
<br></span>

I need to extract all the font-sizes that occur in this html file. I am using beautifulsoup, but I know only how to extract the text.

I can extract the text using the following code:

from bs4 import BeautifulSoup
htmlData = open('/home/usr/Downloads/files/output.html', 'r')
soup = BeautifulSoup(htmlData)

texts = soup.findAll(text=True)

I need to extract the font size of each piece of text and store the font-text pair into a list or array. To be specific, I want to have a data structure like [('One','30'),('Two','15')] and so on where 30 is from the font-size:30px and 15 from font-size:15px

The only problem is that I can't figure out a way to get the font-size value. Any ideas?


回答1:


Hope this helps : I suggest you to read more documents on BeautifulSoup

from bs4 import BeautifulSoup
htmlData = open('/home/usr/Downloads/files/output.html', 'r')
soup = BeautifulSoup(htmlData)

font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]
output = []
for i in font_spans:
    tup = ()
    fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2)
    tup = (str(i.text).strip(), fonts_size.strip())
    output.append(tup)

print(output)
[('One', '30'),('Two', '15'), ....]

If you want to eliminate text values which contains txt you may add if not 'txt' in i.text:

Explanation :

First you need to identify tags which contains font-size,

font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]

Then you need to iterate font_spans and extract font-size and text value,

textvalue = i.text # One,Two..
fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2) # 30, 15, 16..

and Finally you need to create a list which contains all your output as in tuples.




回答2:


You can use a css select select("[style*=font-size]") to fing tags with a style attribute that contains font-size and use a regex to extract the value:

In [12]: from bs4 import BeautifulSoup

In [13]: import re

In [14]: soup = BeautifulSoup(html, "html.parser")

In [15]: patt = re.compile("font-size:(\d+)")

In [16]: [(tag.text.strip(), patt.search(tag["style"]).group(1)) for tag in soup.select("[style*=font-size]")]
Out[16]: 
[('One', '30'),
 ('Two', '15'),
 (': two txt', '16'),
 ('Three', '15'),
 (': Three txt', '16'),
 ('Four', '15'),
 (': Four txt', '16'),
 ('FIVE', '19'),
 ('five txt\nfive txt2\nfive txt3', '18'),
 ('SIX', '19'),
 ('six txt', '17'),
 ('six txt2\n- six txt2\n• six txt3\n• six txt4\n• six txt5', '18')]



回答3:


You have to make some research for your self, the beautiful soup documentation and the regex doc are something that you should read and understand how the things flow.

Check out the following example which using a regular expression to extract the first occurrence of the font-size and then splitted properly to get only the pixel numbers.

from bs4 import BeautifulSoup as Soup
from bs4 import Tag
import re

data = """
  <span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
  <div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
  <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:71px; width:322px; height:38px;">
    <span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:30px">One
    <br></span>
  </div>
  <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:104px; width:175px; height:40px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Two</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: two txt
  <br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Three</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Three txt
  <br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Four</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Four txt
  <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:274px; top:144px; width:56px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">FIVE
  <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:171px; width:515px; height:44px;"><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">five txt
  <br>five txt2 
  <br>five txt3
  <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:220px; top:223px; width:164px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">SIX
  <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:44px; top:247px; width:489px; height:159px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:17px">six txt
  <br></span><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">six txt2
  <br>- six txt2
  <br> six txt3
  <br> six txt4 
  <br> six txt5
  <br></span>
"""
soup = Soup(data, 'html.parser')

def get_the_start_of_font(attr):
  """ Return the index of the 'font-size' first occurrence or None. """
  match = re.search(r'font-size:', attr)
  if match is not None:
    return match.start()
  return None 

def get_font_size_from(attr):
  """ Return the font size as string or None if not found. """
  font_start_i = get_the_start_of_font(attr)
  if font_start_i is not None:
    return str(attr[font_start_i + len('font-size:'):].split('px')[0])
  return None

# iterate through all descendants:
fonts = []
for child in soup.descendants:
  if isinstance(child, Tag) is True and child.get('style') is not None:
    font = get_font_size_from(child.get('style'))
    if font is not None:
      fonts.append([
        str(child.text).strip(), font])

print(fonts)

The solution can be improved, but this is a working example.



来源:https://stackoverflow.com/questions/39012739/need-to-extract-all-the-font-sizes-and-the-text-using-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!