regex not working in bs4

余生颓废 提交于 2019-12-10 22:02:19

问题


I am trying to extract some links from a specific filehoster on watchseriesfree.to website. In the following case I want rapidvideo links, so I use regex to filter out those tags with text containing rapidvideo

import re
import urllib2
from bs4 import BeautifulSoup

def gethtml(link):
    req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
    con = urllib2.urlopen(req)
    html = con.read()
    return html


def findLatest():
    url = "https://watchseriesfree.to/serie/Madam-Secretary"
    head = "https://watchseriesfree.to"

    soup = BeautifulSoup(gethtml(url), 'html.parser')
    latep = soup.find("a", title=re.compile('Latest Episode'))

    soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
    firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

    return firstVod

print(findLatest())

However, the above code returns a blank list. What am I doing wrong?


回答1:


The problem is here:

firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

When BeautifulSoup will apply your text regex pattern, it would use .string attribute values of all the matched tr elements. Now, the .string has this important caveat - when an element has multiple children, .string is None:

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None.

Hence, you have no results.

What you can do is to check the actual text of the tr elements by using a searching function and calling .get_text():

soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())


来源:https://stackoverflow.com/questions/43036243/regex-not-working-in-bs4

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!