How to use Beautiful Soup to extract string in [removed] tag?

后端 未结 4 2009
梦如初夏
梦如初夏 2020-11-28 14:45

In a given .html page, I have a script tag like so:

     

        
相关标签:
4条回答
  • 2020-11-28 14:50

    I ran into a similar problem and the issue seems to be that calling script_tag.text returns an empty string. Instead, you have to call script_tag.string. Maybe this changed in some version of BeautifulSoup?

    Anyway, @alecxe's answer didn't work for me, so I modified their solution:

    import re
    
    from bs4 import BeautifulSoup
    
    data = """
    <body>
        <script>jQuery(window).load(function () {
          setTimeout(function(){
            jQuery("input[name=Email]").val("name@email.com");
          }, 1000);
        });</script>
    </body>
    """
    soup = BeautifulSoup(data, "html.parser")
    
    script_tag = soup.find("script")
    if script_tag:
      # contains all of the script tag, e.g. "jQuery(window)..."
      script_tag_contents = script_tag.string
    
      # from there you can search the string using a regex, etc.
      email = re.search(r'\.+val\("(.+)"\);', script_tag_contents).group(1)
      print(email)
    

    This prints name@email.com.

    0 讨论(0)
  • 2020-11-28 14:51

    not possible using only BeautifulSoup, but you can do it for example with BS + regular expressions

    import re
    from bs4 import BeautifulSoup as BS
    
    html = """<script> ... </script>"""
    
    bs = BS(html)
    
    txt = bs.script.get_text()
    
    email = re.match(r'.+val\("(.+?)"\);', txt).group(1)
    

    or like this:

    ...
    
    email = txt.split('.val("')[1].split('");')[0]
    
    0 讨论(0)
  • 2020-11-28 14:56

    You could solve this with just a couple of lines of gazpacho and .split, no regex required!

    from gazpacho import Soup
    
    html = """\
    <script>jQuery(window).load(function () {
      setTimeout(function(){
        jQuery("input[name=Email]").val("name@email.com");
      }, 1000);
    });</script>
    """
    
    soup = Soup(html)
    string = soup.find("script").text
    string.split(".val(\"")[-1].split("\");")[0]
    

    Which would output:

    'name@email.com'
    
    0 讨论(0)
  • 2020-11-28 15:05

    To add a bit more to the @Bob's answer and assuming you need to also locate the script tag in the HTML which may have other script tags.

    The idea is to define a regular expression that would be used for both locating the element with BeautifulSoup and extracting the email value:

    import re
    
    from bs4 import BeautifulSoup
    
    
    data = """
    <body>
        <script>jQuery(window).load(function () {
          setTimeout(function(){
            jQuery("input[name=Email]").val("name@email.com");
          }, 1000);
        });</script>
    </body>
    """
    pattern = re.compile(r'\.val\("([^@]+@[^@]+\.[^@]+)"\);', re.MULTILINE | re.DOTALL)
    soup = BeautifulSoup(data, "html.parser")
    
    script = soup.find("script", text=pattern)
    if script:
        match = pattern.search(script.text)
        if match:
            email = match.group(1)
            print(email)
    

    Prints: name@email.com.

    Here we are using a simple regular expression for the email address, but we can go further and be more strict about it but I doubt that would be practically necessary for this problem.

    0 讨论(0)
提交回复
热议问题