问题
I have a very simple problem. I'm trying to get the job description from the html of a linkedIn page, but instead of getting the html of the page I'm getting few lines that look like a javascript code instead. I'm very new to this so any help will be greatly appreciated! Thanks
Here's my code:
import requests
url = "https://www.linkedin.com/jobs/view/inside-sales-manager-at-stericycle-1089095836/"
page_html = requests.get(url).text
print(page_html)
When I run this I don't get the html that I expect containing the job description...I just get few lines of javascript code instead.
回答1:
Some websites present different content based on the type of browser that is accessing the site. LinkedIn is a perfect example of such behavior. If the browser has advanced capabilities, the website may present “richer” content – something more dynamic and styled. And using the bot won't help to see these websites.
To solve this problem, you need to follow these steps:
- Download chrome-driver from here. Choose the one that matches your OS.
- Extract the driver and put it in a certain directory. For example,
\usr
- Install
Selenium
which is a python module by runningpip install selenium
. Note that, selenium depends on another package calledmsgpack
. So, you should install it first using this commandpip install msgpack
. - Now, we are ready to run the following code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
#headless tag created an invisible browser
browser_options.add_argument("--headless")
browser_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
print("Done Creating Browser")
return browser
url = "https://www.linkedin.com/jobs/view/inside-sales-manager-at-stericycle-1089095836/"
browser = create_browser('/usr/chromedriver') #DON'T FORGET TO CHANGE THIS AS YOUR DIRECTORY
browser.get(url)
page_html = browser.page_source
print(page_html[-10:]) #prints dy></html>
Now, you have the whole page. I hope this answers your question!!
来源:https://stackoverflow.com/questions/54396285/python-requests-geturl-returning-javascript-code-instead-of-the-page-html