问题
I'm trying to parse the HTML of a webpage that requires being logged in. I can get the HTML of a webpage using this script:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re
webpage = urlopen ('https://www.example.com')
soup = BeautifulSoup (webpage)
print soup
#This would print the source of example.com
But trying to get the source of a webpage that I'm logged into proves to be more difficult. I tried replacing the ('https://www.example.com') with ('https://user:pass@example.com') but I got an Invalid URL error.
Anyone know how I could do this? Thanks in advance.
回答1:
Selenium WebDriver ( http://seleniumhq.org/projects/webdriver/ ) might be good for your needs here. You can log in to the page and then print the contents of the HTML. Here's an example:
from selenium import webdriver
# initiate
driver = webdriver.Firefox() # initiate a driver, in this case Firefox
driver.get("http://example.com") # go to the url
# locate the login form
username_field = driver.find_element_by_name(...) # get the username field
password_field = driver.find_element_by_name(...) # get the password field
# log in
username_field.send_keys("username") # enter in your username
password_field.send_keys("password") # enter in your password
password_field.submit() # submit it
# print HTML
html = driver.page_source
print html
回答2:
I suggest you could use Mechanize.
Python mechanize login to website
In mechanize you setup a browser object so cookies etc can be taken care of.
You can iterate through the form and links.. e.g.
for form in browser.forms():
print form
you can select the form you want and fill it in how you want.
回答3:
you can try sending POST request to the login form (with the login credentials), afterwards save the recieved cookie and supply it while trying to download the page where you need to be logged in.
回答4:
We can do it using selenium module as below
from selenium.selenium import selenium
from selenium import webdriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import webbrowser
# initiate
my_browser = webdriver.Firefox()
my_browser.get("fill with url of the login page ")
try:
my_browser.implicitly_wait(35)
username_field = my_browser.find_element_by_name(' enter the value of the name attribute')#value of the name attribute in the source code
password_field = my_browser.find_element_by_name('enter the value of the name attribute')
username_field.send_keys("fill_with password")
password_field.send_keys("fill with User_name")
password_field.submit() # submit it
finally:
print 'Look Into the Browser'
来源:https://stackoverflow.com/questions/9387500/python-how-do-i-parse-html-of-a-webpage-that-requires-being-logged-in