问题
I'm using Django and Python 3.7. I want to have more efficient parsing so I was reading about SoupStrainer objects. I created a custom one to help me parse only the elements I need ...
def my_custom_strainer(self, elem, attrs):
for attr in attrs:
print("attr:" + attr + "=" + attrs[attr])
if elem == 'div' and 'class' in attr and attrs['class'] == "score":
return True
elif elem == "span" and elem.text == re.compile("my text"):
return True
article_stat_page_strainer = SoupStrainer(self.my_custom_strainer)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
One of the conditions is I only want to parse "span" elements whose text matches a certain pattern. Hence the
elem == "span" and elem.text == re.compile("my text")
clause. However, this results in an
AttributeError: 'str' object has no attribute 'text'
error when I try and run the above. What's the proper way to write my strainer?
回答1:
TLDR; No, this is currently not easily possible in BeautifulSoup (modification of BeautifulSoup and SoupStrainer objects would be needed).
Explanation:
The problem is that the Strainer-passed function gets called on handle_starttag() method. As you can guess, you only have values in the opening tag (eg. element name and attrs).
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524
if (self.parse_only and len(self.tagStack) <= 1
and (self.parse_only.text
or not self.parse_only.search_tag(name, attrs))):
return None
And as you can see, if your Strainer function returns False, the element gets discarded immediately, without having chance to take the inner text inside into consideration (unfortunately).
On the other hand if you add "text" to search.
SoupStrainer(text="my text")
it will start to search inside the tag for text, but this doesn't have context of element or attributes - you can see the irony :/
and combining it together will just find nothing. And you can't even access parent like shown here in find function: https://gist.github.com/RichardBronosky/4060082
So currently Strainers are just good to filter on elements/attrs. You would need to change a lot of Beautiful soup code to get that working.
If you really need this, I suggest inheriting BeautifulSoup and SoupStrainer objects and modifying their behavior.
回答2:
It seems you try to loop along soup elements in my_custom_strainer method.
In order to do so, you could do it as follows:
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
my_custom_strainer(soup, attrs)
Then slightly modify my_custom_strainer to meet something like:
def my_custom_strainer(soup, attrs):
for attr in attrs:
print("attr:" + attr + "=" + attrs[attr])
for d in soup.findAll(['div','span']):
if d.name == 'span' and 'class' in attr and attrs['class'] == "score":
return d.text # meet your needs here
elif d.name == 'span' and d.text == re.compile("my text"):
return d.text # meet your needs here
This way you can access the soup objects iteratively.
回答3:
I recently created a lxml / BeautifulSoup parser for html files, which also searches between specific tags.
The function I wrote opens up a your operating system's file manager and allows you to select the specifi html file to parse.
def openFile(self):
options = QFileDialog.Options()
options |= QFileDialog.DontUseNativeDialog
fileName, _ = QFileDialog.getOpenFileName(self, "QFileDialog.getOpenFileName()", "",
"All Files (*);;Python Files (*.py)", options=options)
if fileName:
file = open(fileName)
data = file.read()
soup = BeautifulSoup(data, "lxml")
for item in soup.find_all('strong'):
results.append(float(item.text))
print('Score =', results[1])
print('Fps =', results[0])
You can see that the tag i specified was 'strong', and i was trying to find the text within that tag.
Hope I could help in someway.
来源:https://stackoverflow.com/questions/54838079/how-do-i-write-a-beautifulsoup-strainer-that-only-parses-objects-with-certain-te