问题
I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image:
I am using the following code:
def get_single_item_data(item_url):
sourceCode = requests.get(item_url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText)
for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}):
print(item_name.string)
item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)
The result is None
. The problem is that soup.findAll('h4',{'data-bind':"text: rankingText"})
outputs:
[<h4 data-bind="text: rankingText"></h4>]
but in the html of the link when inspecting this is like:
<h4 data-bind="text: rankingText">1st</h4>
. It can be seen in the image:
Its clear that the text is missing. How can I overpass that?
Edit:
Printing the soup
variable in the terminal I can see that this value exists:
So there should be a way to access through soup
.
Edit 2: I tried unsuccessfully to use the most voted answer from this stackoverflow question. Could be a solution around there.
回答1:
If you aren't going to try browser automation through selenium
as @Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script
by a regular expression pattern, then extracts the profile
object, loads it with json into a Python dictionary and prints out the desired ranking:
import re
import json
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.kaggle.com/titericz")
soup = BeautifulSoup(response.content, "html.parser")
pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
profile_text = pattern.search(script.text).group(1)
profile = json.loads(profile_text)
print profile["ranking"], profile["rankingText"]
Prints:
1 1st
回答2:
The data is databound using javascript, as the "data-bind" attribute suggests.
However, if you download the page with e.g. wget
, you'll see that the rankingText value is actually there inside this script element on initial load:
<script type="text/javascript"
profile: {
...
"ranking": 96,
"rankingText": "96th",
"highestRanking": 3,
"highestRankingText": "3rd",
...
So you could use that instead.
回答3:
I have solved your problem using regex on the plain text:
def get_single_item_data(item_url):
sourceCode = requests.get(item_url)
plainText = sourceCode.text
#soup = BeautifulSoup(plainText, "html.parser")
pattern = re.compile("ranking\": [0-9]+")
name = pattern.search(plainText)
ranking = name.group().split()[1]
print(ranking)
item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)
This return only the rank number, but I think it will help you, since from what I see the rankText just add 'st', 'th' and etc to the right of the number
回答4:
This could because of dynamic data filling.
Some javascript code, fill this tag after page loading. Thus if you fetch the html using requests it is not filled yet.
<h4 data-bind="text: rankingText"></h4>
Please take a look at Selenium web driver. Using this driver you can fetch the complete page and running js as normal.
来源:https://stackoverflow.com/questions/34336040/pythongetting-text-from-html-using-beautifulsoup