extracting html content from gitlab url

≡放荡痞女 提交于 2020-04-11 12:19:43

问题


I'm trying to get the html content from gitlab url.
But I was struck at Gitlab sign-in page and I am getting html content of sign-in page even after providing username and password.

Code:

    from bs4 import BeautifulSoup 
    import requests
    username = "username"
    password = "password"
    url = "HTTP://gitlab.com/saikumar/webhooktslint"
    result=requests.get(url, auth=("username", "password")).content  /* 
    gets 
    content from the site */
    soup = BeautifulSoup(result,'lxml')
    for link in soup:
       print link

Output:

   Getting HTML content of sign_in page.

Expected output:

   Need to get the HTML content of the URL specified.

回答1:


I don't see a repo webhooktslint in your gitlab.com/saikumar page, so it is likely to be a private repository.

Looking at python GitLab CLI usage, make sure to properly set your ~/.python-gitlab.cfg user configuration file, with a GitLab private token in it: you won't have to deal with credentials then.

The gitlab python command will do the curl for you, including to get the raw data of a file.

But that same private token can help authenticate you when trying to do a GET of a private repo as you do in your code (if you are after the actual HTML page content).

Main point, to access a private repo, use a PAT (Personal Access Token) rather than your actual account password.




回答2:


I am having the same use case here. I would like to access to the gitlab page to get html page content (private repo) but it always direct me to sign in page even I already pass the authentication (I refer to here: https://gist.github.com/gpocentek/bd4c3fbf8a6ce226ebddc4aad6b46c0a)

Below is my code:

import urllib, re, sys, requests
from bs4 import BeautifulSoup

LOGIN_URL = 'https://gitlab.devtools.com//users/auth/ldapmain/callback'

session = requests.Session()

data = {'username': username,
    'password': password,
    'authenticity_token': token}
r = session.post(LOGIN_URL, data=data)

print r.status_code
url = "https://gitlab.devtools.com/Sandbox/testing/merge_requests/2" 
html = session.get(url)
print html.url

Any idea on this? Am I missing anything?



来源:https://stackoverflow.com/questions/55673936/extracting-html-content-from-gitlab-url

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!