Python regular expression for HTML parsing (BeautifulSoup)

前端未结

关注

 7  2236

感情败类 2020-11-27 19:21

I want to grab the value of a hidden input field in HTML.

7条回答

失恋的感觉 (楼主)

2020-11-27 19:54

Pyparsing is a good interim step between BeautifulSoup and regex. It is more robust than just regexes, since its HTML tag parsing comprehends variations in case, whitespace, attribute presence/absence/order, but simpler to do this kind of basic tag extraction than using BS.

Your example is especially simple, since everything you are looking for is in the attributes of the opening "input" tag. Here is a pyparsing example showing several variations on your input tag that would give regexes fits, and also shows how NOT to match a tag if it is within a comment:

html = """







"""

from pyparsing import makeHTMLTags, withAttribute, htmlComment

# use makeHTMLTags to create tag expression - makeHTMLTags returns expressions for
# opening and closing tags, we're only interested in the opening tag
inputTag = makeHTMLTags("input")[0]

# only want input tags with special attributes
inputTag.setParseAction(withAttribute(type="hidden", name="fooId"))

# don't report tags that are commented out
inputTag.ignore(htmlComment)

# use searchString to skip through the input 
foundTags = inputTag.searchString(html)

# dump out first result to show all returned tags and attributes
print foundTags[0].dump()
print

# print out the value attribute for all matched tags
for inpTag in foundTags:
    print inpTag.value

Prints:

['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
- empty: True
- name: fooId
- startInput: ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
  - empty: True
  - name: fooId
  - type: hidden
  - value: **[id is here]**
- type: hidden
- value: **[id is here]**

**[id is here]**
**[id is here too]**
**[id is HERE too]**
**[and id is even here TOO]**

You can see that not only does pyparsing match these unpredictable variations, it returns the data in an object that makes it easy to read out the individual tag attributes and their values.

0 讨论(0)

查看其它7个回答