问题
I'm trying to get the html content from gitlab url.
But I was struck at Gitlab sign-in page and I am getting html content of sign-in page even after providing username and password.
Code:
from bs4 import BeautifulSoup
import requests
username = "username"
password = "password"
url = "HTTP://gitlab.com/saikumar/webhooktslint"
result=requests.get(url, auth=("username", "password")).content /*
gets
content from the site */
soup = BeautifulSoup(result,'lxml')
for link in soup:
print link
Output:
Getting HTML content of sign_in page.
Expected output:
Need to get the HTML content of the URL specified.
回答1:
I don't see a repo webhooktslint
in your gitlab.com/saikumar page, so it is likely to be a private repository.
Looking at python GitLab CLI usage, make sure to properly set your ~/.python-gitlab.cfg
user configuration file, with a GitLab private token in it: you won't have to deal with credentials then.
The gitlab python command will do the curl for you, including to get the raw data of a file.
But that same private token can help authenticate you when trying to do a GET of a private repo as you do in your code (if you are after the actual HTML page content).
Main point, to access a private repo, use a PAT (Personal Access Token) rather than your actual account password.
回答2:
I am having the same use case here. I would like to access to the gitlab page to get html page content (private repo) but it always direct me to sign in page even I already pass the authentication (I refer to here: https://gist.github.com/gpocentek/bd4c3fbf8a6ce226ebddc4aad6b46c0a)
Below is my code:
import urllib, re, sys, requests
from bs4 import BeautifulSoup
LOGIN_URL = 'https://gitlab.devtools.com//users/auth/ldapmain/callback'
session = requests.Session()
data = {'username': username,
'password': password,
'authenticity_token': token}
r = session.post(LOGIN_URL, data=data)
print r.status_code
url = "https://gitlab.devtools.com/Sandbox/testing/merge_requests/2"
html = session.get(url)
print html.url
Any idea on this? Am I missing anything?
来源:https://stackoverflow.com/questions/55673936/extracting-html-content-from-gitlab-url