How to do a partial conditioning on a tag for find_all() in bs4?

问题

I have an xml which has multiple tags which look like this:

<textblock height="55" hpos="143" id="Page1_Block5" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">

I want to get all the <textblock> tags clustered by a Page (id property in the textblock tag). However, my id is written in the following way: id="Page1_Block5".

However, I want to condition only on the Page number, and not the block number. (I want all blocks of a specific page).

I am trying to do the same via:

xml_soup = bs.BeautifulSoup(table, 'lxml')

text_blocks = xml_soup.find_all('textblock')

What more parameters would I need to add inside my find_all() function to be able to condition my results only on the Page{}?

回答1:

This should help u:

text_blocks = xml_soup.find_all('textblock', id = lambda value: value and value.startswith("Page1"))

This is my entire code:

from bs4 import BeautifulSoup

xml = """
<textblock height="55" hpos="143" id="Page1_Block5" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
"""

xml_soup = BeautifulSoup(xml,'lxml')

text_blocks = xml_soup.find_all('textblock', id = lambda value: value and value.startswith("Page1"))

Explanation:

The lambda function checks whether the id starts with Page1. If yes, then it retrieves the tag. I have also added few more values to the xml variable. Here is the test data that I used:

xml = """
<textblock height="55" hpos="143" id="Page1_Block5" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
<textblock height="55" hpos="143" id="Page1_Block4" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
<textblock height="55" hpos="143" id="Page2_Block5" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
<textblock height="55" hpos="143" id="Page1_Block1" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
"""

As u can see, there are 3 textblock tags with an id that starts with Page1. When I ran my code using this test data and printed out the length of the variable text_blocks, this is the output that I got:

>>> len(text_blocks)
3

This shows that the code works! Hope that this helps!

P.S: U can refer to this link for more details about extracting elements with an id that starts with a particular string.

来源：https://stackoverflow.com/questions/64279067/how-to-do-a-partial-conditioning-on-a-tag-for-find-all-in-bs4

标签

python-3.x

beautifulsoup

xml-parsing