Scraping text from Kickstarter projects return nothing

问题

I am trying to scrape the main text of a project from the Kickstarter project webpage. I have the following code which works for the first URL but does not work for the second and third URL. I was wondering if there is an easy fix to my code without the need to use other packages?

url = "https://www.kickstarter.com/projects/1365297844/kuhkubus-3d-escher-figures?ref=discovery_staff_picks_category_newest"
#url = "https://www.kickstarter.com/projects/clarissaredwine/swingby-a-voyager-gravity-puzzle?ref=discovery_staff_picks_category_newest"
#url = "https://www.kickstarter.com/projects/100389301/us-army-navy-marines-air-force-special-challenge-c?ref=category"

page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
body_text = soup.find(class_='rte__content')
all_text = body_text.find_all('p')
for i in all_text:
    print(i.get_text())

回答1:

There is a GraphQL API used by this site on :

POST https://www.kickstarter.com/graph

We can use it to get the site data instead of scraping html for any URL (any project). Also, there is two fields story and risks that we will extract.

This Graphql API needs a csrf token that is embedded in a meta tag in the page (any page will do). Also we need to store the cookies using request session otherwise the call will fail.

Here is an example of simple usage of the API using python :

import requests
from bs4 import BeautifulSoup

s = requests.Session()
r = s.get("https://www.kickstarter.com")
soup = BeautifulSoup(r.text, 'html.parser')
xcsrf = soup.find("meta", {"name": "csrf-token"})["content"]

query = """
query GetEndedToLive($slug: String!) {
  project(slug: $slug) {
      id
      deadlineAt
      showCtaToLiveProjects
      state
      description
      url
      __typename
  }
}"""

r = s.post("https://www.kickstarter.com/graph",
    headers= {
        "x-csrf-token": xcsrf
    },
    json = {
        "query": query,
        "variables": {
            "slug":"kuhkubus-3d-escher-figures"
        }
    })

print(r.json())

From your second link it shows interesting fields in the query. The complete query is the following :

query Campaign($slug: String!) {
  project(slug: $slug) {
    id
    isSharingProjectBudget
    risks
    story(assetWidth: 680)
    currency
    spreadsheet {
      displayMode
      public
      url
      data {
        name
        value
        phase
        rowNum
        __typename
      }
      dataLastUpdatedAt
      __typename
    }
    environmentalCommitments {
      id
      commitmentCategory
      description
      __typename
    }
    __typename
  }
}

We are only interested in the story and risks so we will have :

query Campaign($slug: String!) {
  project(slug: $slug) {
    risks
    story(assetWidth: 680)
  }
}

Note that we need the project slug which is a part of the url, for instance clarissaredwine/swingby-a-voyager-gravity-puzzle is the slug for you 2nd url.

Here is a sample implementation that extract the slugs, loop through the slugs and call the GraphQL endpoint for each slug, it prints the story and the risks for each of them :

import requests
from bs4 import BeautifulSoup
import re

urls = [ 
    "https://www.kickstarter.com/projects/1365297844/kuhkubus-3d-escher-figures?ref=discovery_staff_picks_category_newest",
    "https://www.kickstarter.com/projects/clarissaredwine/swingby-a-voyager-gravity-puzzle?ref=discovery_staff_picks_category_newest",
    "https://www.kickstarter.com/projects/100389301/us-army-navy-marines-air-force-special-challenge-c?ref=category"
]
slugs = []

#extract slugs from url
for url in urls:
    slugs.append(re.search('/projects/(.*)\?', url).group(1))

s = requests.Session()
r = s.get("https://www.kickstarter.com")
soup = BeautifulSoup(r.text, 'html.parser')
xcsrf = soup.find("meta", {"name": "csrf-token"})["content"]

query = """
query Campaign($slug: String!) {
  project(slug: $slug) {
    risks
    story(assetWidth: 680)
  }
}"""

for slug in slugs:
    print(f"--------{slug}------")
    r = s.post("https://www.kickstarter.com/graph",
        headers= {
            "x-csrf-token": xcsrf
        },
        json = {
            "operationName":"Campaign",
            "variables":{
                "slug": slug
            },
            "query": query
        })

    result = r.json()

    print("-------STORY--------")
    story_html = result["data"]["project"]["story"]
    soup = BeautifulSoup(story_html, 'html.parser')
    for i in soup.find_all('p'):
        print(i.get_text())

    print("-------RISKS--------")
    print(result["data"]["project"]["risks"])

I guess you can use the graphQL endpoint for many other things if you are scraping other content on this site. However, note that the introspection has been disabled on this API, so you can only look for existing schema usage on the site (you can't get the whole schema)

来源：https://stackoverflow.com/questions/62335537/scraping-text-from-kickstarter-projects-return-nothing

标签

html

python-3.x

web-scraping

beautifulsoup

graphql