Retrieving a subset of href's from findall() in BeautifulSoup

问题

My goal is to write a python script that takes an artist's name as a string input and then appends it to the base URL that goes to the genius search query.Then retrieves all the lyrics from the returned web page's links (Which is the required subset of this problem that will also contain specifically the artist name in every link in that subset.).I am in the initial phase right now and just have been able to retrieve all links from the web page including the ones that I don't want in my subset. I tried to find a simple solution but failed continuously.

import requests
# The Requests library.

from bs4 import BeautifulSoup
from lxml import html

user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input

header = {'User-Agent':''}
response = requests.get(base_url, headers=header)

soup = BeautifulSoup(response.content, "lxml")

for link in soup.find_all('a',href=True):
        print (link['href'])

This returns this complete list while I only need the ones that end with lyrics and the artist's name (here for instance Drake). These will the links from where I should be able to retrieve the lyrics.

https://genius.com/
/signup
/login
https://www.facebook.com/geniusdotcom/
https://twitter.com/Genius
https://www.instagram.com/genius/
https://www.youtube.com/user/RapGeniusVideo
https://genius.com/new
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
/search?page=2&q=drake
/search?page=3&q=drake
/search?page=4&q=drake
/search?page=5&q=drake
/search?page=6&q=drake
/search?page=7&q=drake
/search?page=8&q=drake
/search?page=9&q=drake
/search?page=672&q=drake
/search?page=673&q=drake
/search?page=2&q=drake
/embed_guide
/verified-artists
/contributor_guidelines
/about
/static/press
mailto:brands@genius.com
https://eventspace.genius.com/
/static/privacy_policy
/jobs
/developers
/static/terms
/static/copyright
/feedback/new
https://genius.com/Genius-how-genius-works-annotated
https://genius.com/Genius-how-genius-works-annotated

My next step would be to use selenium to emulate scroll which in the case of genius.com gives the entire set of search results. Any suggestions or resources would be appreciated. I would also like a few comments about the way I wish to proceed with this solution. Can we make it more generic?

P.S. I may not have well lucidly explained my problem but I have tried my best. Also, any ambiguities are welcome too. I am new to scraping and python and programming as well in so, just wanted to make sure that I am following the right path.

回答1:

Use the regex module to match only the links you want.

import requests
# The Requests library.

from bs4 import BeautifulSoup
from lxml import html
from re import compile

user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input

header = {'User-Agent':''}
response = requests.get(base_url, headers=header)

soup = BeautifulSoup(response.content, "lxml")

pattern = re.compile("[\S]+-lyrics$")

for link in soup.find_all('a',href=True):
    if pattern.match(link['href']):
        print (link['href'])

Output:

https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics

This just looks if your link matches the pattern ending in -lyrics. You may use similar logic to filter using user_input variable as well.

Hope this helps.

来源：https://stackoverflow.com/questions/43261897/retrieving-a-subset-of-hrefs-from-findall-in-beautifulsoup

标签

python

python-2.7

web-scraping

beautifulsoup

lxml