Python - Scrape movies titles with Splash & BS4

别等时光非礼了梦想. 提交于 2020-12-15 06:14:51

问题


I try to create my first script with Python. I'm using Splash and BS4.

I followed this tutorial from John Watson Rooney (but with my own target) : How I Scrape JAVASCRIPT websites with Python

My goal is to scrape this website survey : Best movies of 2020

Here's my problem : It renders multiple times the same titles but with up to 6 duplicates in the list without any logical order. Sometimes it renders less than 100 lines, sometimes more?

What I want :

  • Get the 100 titles, by order
  • Export them in a .csv format.

Here is my code :

import requests
import csv
from bs4 import BeautifulSoup

url = 'https://www.senscritique.com/top/resultats/Les_meilleurs_films_de_2020/2582670'

r = requests.get('http://localhost:8050/render.html',
                 params={'url': url, 'wait': 2})

soup = BeautifulSoup(r.text, 'html.parser')

podium = soup.find_all('li', class_="elpo-item")
podium_list = []

for titres in podium:
    for titles in soup.find_all('h2'):
        podium_list.append(titles.text)

for liste in podium_list:
    print(liste)

Questions :

  • How can I scrap only the 100 titles ? What did I missed ?
  • Is my code right, how can I optimize it ?
  • Is Splash really good for my use, or is there another easier library to scrap JS website?

For the .csv part, I'm going to try by myself right now, but if you have any tips, i'm hearing of course!

Thank you for your help.


回答1:


The pages are actually loaded dynamically, at least the next 50 movies, so you have to make a request to a different endpoint.

Here's how:

import requests
from bs4 import BeautifulSoup


headers = {
    "Referer": "https://www.senscritique.com/top/resultats/Les_meilleurs_films_de_2020/2582670",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest",
}

titles = []
for page in range(1, 3):
    url = f"https://www.senscritique.com/sc2/top/resultats/2582670/page-{page}.ajax?limit=1000"
    soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser").find_all("li", {"class": "elpo-item"})
    titles.extend(i.find("h2").getText(strip=True).replace("(2020)", "") for i in soup)

for title in titles:
    print(title)

Output:

1917
Jojo Rabbit
The Gentlemen
Uncut Gems
Tenet
Le Cas Richard Jewell
Dark Waters
En avant
Drunk
Les Filles du docteur March
Adieu les cons
Invisible Man
Play
Les Enfants du temps
L'Adieu
La Plateforme
Séjour dans les monts Fuchun
The King of Staten Island
Été 85
Birds of Prey (et la Fantabuleuse Histoire de Harley Quinn)
La Communion
and so on...

The movies are in order, by the way.



来源:https://stackoverflow.com/questions/64836790/python-scrape-movies-titles-with-splash-bs4

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!