Python regular expression for HTML parsing (BeautifulSoup)

前端 未结 7 2236
感情败类
感情败类 2020-11-27 19:21

I want to grab the value of a hidden input field in HTML.


I

7条回答
  •  失恋的感觉
    2020-11-27 19:54

    Pyparsing is a good interim step between BeautifulSoup and regex. It is more robust than just regexes, since its HTML tag parsing comprehends variations in case, whitespace, attribute presence/absence/order, but simpler to do this kind of basic tag extraction than using BS.

    Your example is especially simple, since everything you are looking for is in the attributes of the opening "input" tag. Here is a pyparsing example showing several variations on your input tag that would give regexes fits, and also shows how NOT to match a tag if it is within a comment:

    html = """
    
    
    
    
    
    
    
    """
    
    from pyparsing import makeHTMLTags, withAttribute, htmlComment
    
    # use makeHTMLTags to create tag expression - makeHTMLTags returns expressions for
    # opening and closing tags, we're only interested in the opening tag
    inputTag = makeHTMLTags("input")[0]
    
    # only want input tags with special attributes
    inputTag.setParseAction(withAttribute(type="hidden", name="fooId"))
    
    # don't report tags that are commented out
    inputTag.ignore(htmlComment)
    
    # use searchString to skip through the input 
    foundTags = inputTag.searchString(html)
    
    # dump out first result to show all returned tags and attributes
    print foundTags[0].dump()
    print
    
    # print out the value attribute for all matched tags
    for inpTag in foundTags:
        print inpTag.value
    

    Prints:

    ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
    - empty: True
    - name: fooId
    - startInput: ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
      - empty: True
      - name: fooId
      - type: hidden
      - value: **[id is here]**
    - type: hidden
    - value: **[id is here]**
    
    **[id is here]**
    **[id is here too]**
    **[id is HERE too]**
    **[and id is even here TOO]**
    

    You can see that not only does pyparsing match these unpredictable variations, it returns the data in an object that makes it easy to read out the individual tag attributes and their values.

提交回复
热议问题