Python: How do I parse HTML of a webpage that requires being logged in?

岁酱吖の 提交于 2019-12-12 18:15:54

问题


I'm trying to parse the HTML of a webpage that requires being logged in. I can get the HTML of a webpage using this script:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re

webpage = urlopen ('https://www.example.com')
soup = BeautifulSoup (webpage)
print soup
#This would print the source of example.com

But trying to get the source of a webpage that I'm logged into proves to be more difficult. I tried replacing the ('https://www.example.com') with ('https://user:pass@example.com') but I got an Invalid URL error.

Anyone know how I could do this? Thanks in advance.


回答1:


Selenium WebDriver ( http://seleniumhq.org/projects/webdriver/ ) might be good for your needs here. You can log in to the page and then print the contents of the HTML. Here's an example:

from selenium import webdriver

# initiate
driver = webdriver.Firefox() # initiate a driver, in this case Firefox
driver.get("http://example.com") # go to the url

# locate the login form
username_field = driver.find_element_by_name(...) # get the username field
password_field = driver.find_element_by_name(...) # get the password field

# log in
username_field.send_keys("username") # enter in your username
password_field.send_keys("password") # enter in your password
password_field.submit() # submit it

# print HTML
html = driver.page_source
print html



回答2:


I suggest you could use Mechanize.

Python mechanize login to website

In mechanize you setup a browser object so cookies etc can be taken care of.

You can iterate through the form and links.. e.g.

for form in browser.forms():
   print form

you can select the form you want and fill it in how you want.




回答3:


you can try sending POST request to the login form (with the login credentials), afterwards save the recieved cookie and supply it while trying to download the page where you need to be logged in.




回答4:


We can do it using selenium module as below

from selenium.selenium import selenium 
from selenium import webdriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import webbrowser


# initiate
my_browser = webdriver.Firefox()
my_browser.get("fill with url of the login page ")
try: 
    my_browser.implicitly_wait(35)
    username_field = my_browser.find_element_by_name(' enter the value of the name attribute')#value of the name attribute in the source code 
    password_field = my_browser.find_element_by_name('enter the value of the name attribute') 
    username_field.send_keys("fill_with password") 
    password_field.send_keys("fill with User_name")
    password_field.submit() # submit it



finally:

    print 'Look Into the Browser'


来源:https://stackoverflow.com/questions/9387500/python-how-do-i-parse-html-of-a-webpage-that-requires-being-logged-in

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!