In a given .html page, I have a script tag like so:
I ran into a similar problem and the issue seems to be that calling script_tag.text
returns an empty string. Instead, you have to call script_tag.string
. Maybe this changed in some version of BeautifulSoup?
Anyway, @alecxe's answer didn't work for me, so I modified their solution:
import re
from bs4 import BeautifulSoup
data = """
<body>
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("name@email.com");
}, 1000);
});</script>
</body>
"""
soup = BeautifulSoup(data, "html.parser")
script_tag = soup.find("script")
if script_tag:
# contains all of the script tag, e.g. "jQuery(window)..."
script_tag_contents = script_tag.string
# from there you can search the string using a regex, etc.
email = re.search(r'\.+val\("(.+)"\);', script_tag_contents).group(1)
print(email)
This prints name@email.com
.
not possible using only BeautifulSoup, but you can do it for example with BS + regular expressions
import re
from bs4 import BeautifulSoup as BS
html = """<script> ... </script>"""
bs = BS(html)
txt = bs.script.get_text()
email = re.match(r'.+val\("(.+?)"\);', txt).group(1)
or like this:
...
email = txt.split('.val("')[1].split('");')[0]
You could solve this with just a couple of lines of gazpacho and .split
, no regex required!
from gazpacho import Soup
html = """\
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("name@email.com");
}, 1000);
});</script>
"""
soup = Soup(html)
string = soup.find("script").text
string.split(".val(\"")[-1].split("\");")[0]
Which would output:
'name@email.com'
To add a bit more to the @Bob's answer and assuming you need to also locate the script
tag in the HTML which may have other script
tags.
The idea is to define a regular expression that would be used for both locating the element with BeautifulSoup and extracting the email
value:
import re
from bs4 import BeautifulSoup
data = """
<body>
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("name@email.com");
}, 1000);
});</script>
</body>
"""
pattern = re.compile(r'\.val\("([^@]+@[^@]+\.[^@]+)"\);', re.MULTILINE | re.DOTALL)
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script", text=pattern)
if script:
match = pattern.search(script.text)
if match:
email = match.group(1)
print(email)
Prints: name@email.com
.
Here we are using a simple regular expression for the email address, but we can go further and be more strict about it but I doubt that would be practically necessary for this problem.