问题
I am trying to extract some links from a specific filehoster on watchseriesfree.to website. In the following case I want rapidvideo links, so I use regex to filter out those tags with text containing rapidvideo
import re
import urllib2
from bs4 import BeautifulSoup
def gethtml(link):
req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
con = urllib2.urlopen(req)
html = con.read()
return html
def findLatest():
url = "https://watchseriesfree.to/serie/Madam-Secretary"
head = "https://watchseriesfree.to"
soup = BeautifulSoup(gethtml(url), 'html.parser')
latep = soup.find("a", title=re.compile('Latest Episode'))
soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))
return firstVod
print(findLatest())
However, the above code returns a blank list. What am I doing wrong?
回答1:
The problem is here:
firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))
When BeautifulSoup will apply your text regex pattern, it would use .string attribute values of all the matched tr elements. Now, the .string has this important caveat - when an element has multiple children, .string is None:
If a tag contains more than one thing, then it’s not clear what
.stringshould refer to, so.stringis defined to beNone.
Hence, you have no results.
What you can do is to check the actual text of the tr elements by using a searching function and calling .get_text():
soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())
来源:https://stackoverflow.com/questions/43036243/regex-not-working-in-bs4