How do I write a BeautifulSoup strainer that only parses objects with certain text between the tags?

北城余情 提交于 2021-02-08 13:24:10

问题


I'm using Django and Python 3.7. I want to have more efficient parsing so I was reading about SoupStrainer objects. I created a custom one to help me parse only the elements I need ...

def my_custom_strainer(self, elem, attrs):
    for attr in attrs:
        print("attr:" + attr + "=" + attrs[attr])
    if elem == 'div' and 'class' in attr and attrs['class'] == "score":
        return True
    elif elem == "span" and elem.text == re.compile("my text"):
        return True

article_stat_page_strainer = SoupStrainer(self.my_custom_strainer)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)

One of the conditions is I only want to parse "span" elements whose text matches a certain pattern. Hence the

elem == "span" and elem.text == re.compile("my text")

clause. However, this results in an

AttributeError: 'str' object has no attribute 'text'

error when I try and run the above. What's the proper way to write my strainer?


回答1:


TLDR; No, this is currently not easily possible in BeautifulSoup (modification of BeautifulSoup and SoupStrainer objects would be needed).

Explanation:

The problem is that the Strainer-passed function gets called on handle_starttag() method. As you can guess, you only have values in the opening tag (eg. element name and attrs).

https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524

if (self.parse_only and len(self.tagStack) <= 1
    and (self.parse_only.text
     or not self.parse_only.search_tag(name, attrs))):
return None

And as you can see, if your Strainer function returns False, the element gets discarded immediately, without having chance to take the inner text inside into consideration (unfortunately).

On the other hand if you add "text" to search.

SoupStrainer(text="my text")

it will start to search inside the tag for text, but this doesn't have context of element or attributes - you can see the irony :/

and combining it together will just find nothing. And you can't even access parent like shown here in find function: https://gist.github.com/RichardBronosky/4060082

So currently Strainers are just good to filter on elements/attrs. You would need to change a lot of Beautiful soup code to get that working.

If you really need this, I suggest inheriting BeautifulSoup and SoupStrainer objects and modifying their behavior.




回答2:


It seems you try to loop along soup elements in my_custom_strainer method.

In order to do so, you could do it as follows:

soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
my_custom_strainer(soup, attrs)

Then slightly modify my_custom_strainer to meet something like:

def my_custom_strainer(soup, attrs):
  for attr in attrs:
    print("attr:" + attr + "=" + attrs[attr])
  for d in soup.findAll(['div','span']):
    if d.name == 'span' and 'class' in attr and attrs['class'] == "score":
      return d.text # meet your needs here
   elif d.name == 'span' and d.text == re.compile("my text"):
      return d.text # meet your needs here

This way you can access the soup objects iteratively.




回答3:


I recently created a lxml / BeautifulSoup parser for html files, which also searches between specific tags.

The function I wrote opens up a your operating system's file manager and allows you to select the specifi html file to parse.

def openFile(self):
    options = QFileDialog.Options()

    options |= QFileDialog.DontUseNativeDialog
    fileName, _ = QFileDialog.getOpenFileName(self, "QFileDialog.getOpenFileName()", "",
                                              "All Files (*);;Python Files (*.py)", options=options)
    if fileName:
        file = open(fileName)
        data = file.read()
        soup = BeautifulSoup(data, "lxml")
        for item in soup.find_all('strong'):
            results.append(float(item.text))
    print('Score =', results[1])
    print('Fps =', results[0])

You can see that the tag i specified was 'strong', and i was trying to find the text within that tag.

Hope I could help in someway.



来源:https://stackoverflow.com/questions/54838079/how-do-i-write-a-beautifulsoup-strainer-that-only-parses-objects-with-certain-te

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!