regex not working in bs4

问题

I am trying to extract some links from a specific filehoster on watchseriesfree.to website. In the following case I want rapidvideo links, so I use regex to filter out those tags with text containing rapidvideo

import re
import urllib2
from bs4 import BeautifulSoup

def gethtml(link):
    req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
    con = urllib2.urlopen(req)
    html = con.read()
    return html


def findLatest():
    url = "https://watchseriesfree.to/serie/Madam-Secretary"
    head = "https://watchseriesfree.to"

    soup = BeautifulSoup(gethtml(url), 'html.parser')
    latep = soup.find("a", title=re.compile('Latest Episode'))

    soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
    firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

    return firstVod

print(findLatest())

However, the above code returns a blank list. What am I doing wrong?

回答1:

The problem is here:

firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

When BeautifulSoup will apply your text regex pattern, it would use .string attribute values of all the matched tr elements. Now, the .string has this important caveat - when an element has multiple children, .string is None:

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None.

Hence, you have no results.

What you can do is to check the actual text of the tr elements by using a searching function and calling .get_text():

soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())

来源：https://stackoverflow.com/questions/43036243/regex-not-working-in-bs4

标签

python

regex

urllib2

bs4