Python - Scrape movies titles with Splash & BS4

问题

I try to create my first script with Python. I'm using Splash and BS4.

I followed this tutorial from John Watson Rooney (but with my own target) : How I Scrape JAVASCRIPT websites with Python

My goal is to scrape this website survey : Best movies of 2020

Here's my problem : It renders multiple times the same titles but with up to 6 duplicates in the list without any logical order. Sometimes it renders less than 100 lines, sometimes more?

What I want :

Get the 100 titles, by order
Export them in a .csv format.

Here is my code :

import requests
import csv
from bs4 import BeautifulSoup

url = 'https://www.senscritique.com/top/resultats/Les_meilleurs_films_de_2020/2582670'

r = requests.get('http://localhost:8050/render.html',
                 params={'url': url, 'wait': 2})

soup = BeautifulSoup(r.text, 'html.parser')

podium = soup.find_all('li', class_="elpo-item")
podium_list = []

for titres in podium:
    for titles in soup.find_all('h2'):
        podium_list.append(titles.text)

for liste in podium_list:
    print(liste)

Questions :

How can I scrap only the 100 titles ? What did I missed ?
Is my code right, how can I optimize it ?
Is Splash really good for my use, or is there another easier library to scrap JS website?

For the .csv part, I'm going to try by myself right now, but if you have any tips, i'm hearing of course!

Thank you for your help.

回答1:

The pages are actually loaded dynamically, at least the next 50 movies, so you have to make a request to a different endpoint.

Here's how:

import requests
from bs4 import BeautifulSoup


headers = {
    "Referer": "https://www.senscritique.com/top/resultats/Les_meilleurs_films_de_2020/2582670",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest",
}

titles = []
for page in range(1, 3):
    url = f"https://www.senscritique.com/sc2/top/resultats/2582670/page-{page}.ajax?limit=1000"
    soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser").find_all("li", {"class": "elpo-item"})
    titles.extend(i.find("h2").getText(strip=True).replace("(2020)", "") for i in soup)

for title in titles:
    print(title)

Output:

1917
Jojo Rabbit
The Gentlemen
Uncut Gems
Tenet
Le Cas Richard Jewell
Dark Waters
En avant
Drunk
Les Filles du docteur March
Adieu les cons
Invisible Man
Play
Les Enfants du temps
L'Adieu
La Plateforme
Séjour dans les monts Fuchun
The King of Staten Island
Été 85
Birds of Prey (et la Fantabuleuse Histoire de Harley Quinn)
La Communion
and so on...

The movies are in order, by the way.

来源：https://stackoverflow.com/questions/64836790/python-scrape-movies-titles-with-splash-bs4

标签

python

web-scraping

beautifulsoup