Using BeautifulSoup where authentication is required

别说谁变了你拦得住时间么 提交于 2020-01-14 03:52:07

问题


I am scraping LAN data using BeautifulSoup4 and Python requests for a company project. Since the site has a login interface, I am not authorized to access the data. The login interface is a pop-up that doesn't allow me to access the page source or inspect the page elements without log in. the error I get is this-

Access Error: Unauthorized Access to this document requires a User ID

This is a screen-shot of the pop-up box (The blackened part is sensitive information). It has not information about the html tags at all, hence I cannot auto-login via python.

I have tried requests_ntlm, selenium, python requests and even ParseHub but it did not work. I have been stuck in this phase for a month now! Please, any help would be appreciated.

Below is my initial code:

import requests
from requests_ntlm import HttpNtlmAuth
from bs4 import BeautifulSoup
r = requests.get("www.amazon.in")
from urllib.request import Request, urlopen
req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
print r.content
r = requests.get("www.amazon.in",auth=HttpNtlmAuth('user_name','passwd'))
print r.content*

s_data = BeautifulSoup(r.content,"lxml")*
print s_data.content

Error: Document Error: Unauthorized

Access Error: Unauthorized

Access to this document requires a User ID

This is the error I get when BeautifulSoup tries to access the data after I have manually logged into the site.


回答1:


Have you considered using mechanise?

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib

cook = cookielib.CookieJar()
req = mechanize.Browser()
req.set_cookiejar(cook)


req.open("http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1")

req.select_form(nr=0)
req.form['username'] = 'username'
req.form['password'] = 'password.'
req.submit()

print req.response().read()

EDIT

If you come up against robots.txt issues and you have permission to circumvent this then take a look at this answer for techniques to do this https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden




回答2:


If you are using BeautifulSoup and requests on Python 3.x, just use this:

from bs4 import BeautifulSoup
import requests

r = requests.get('URL', auth=('USER_NAME', 'PASSWORD'))
soup = BeautifulSoup(r.content)


来源:https://stackoverflow.com/questions/46987241/using-beautifulsoup-where-authentication-is-required

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!