beautifulsoup

Beautiful soup scrape - login credentials not working

本秂侑毒 提交于 2020-05-13 14:34:07
问题 Trying to scrape a page with login credentials. payload = { 'email': '*******@gmail.com', 'password': '***' } urls = [] login_url = 'https://www.spotrac.com/signin/' url = 'https://www.spotrac.com/nba/contracts/breakdown/2010/' webpage = requests.get(login_url, payload) content = webpage.content soup = BeautifulSoup(content) a = soup.find('table',{'class':'datatable'}) urls.append(a) This is my first time scraping a page with credentials, and can't seem to figure out how to properly enter

Insert HTML into an element with BeautifulSoup

荒凉一梦 提交于 2020-05-13 12:23:11
问题 When I try to insert the following HTML into an element <div class="frontpageclass"><h3 id="feature_title">The Title</h3>... </div> bs4 is replacing it like this: <div class="frontpageclass"><h3 id="feature_title">The Title </h3>... <div></div> I am using string and it is still messing up the format. with open(html_frontpage) as fp: soup = BeautifulSoup(fp,"html.parser") found_data = soup.find(class_= 'front-page__feature-image') found_data.string = databasedata If I try to use found_data

Insert HTML into an element with BeautifulSoup

﹥>﹥吖頭↗ 提交于 2020-05-13 12:17:01
问题 When I try to insert the following HTML into an element <div class="frontpageclass"><h3 id="feature_title">The Title</h3>... </div> bs4 is replacing it like this: <div class="frontpageclass"><h3 id="feature_title">The Title </h3>... <div></div> I am using string and it is still messing up the format. with open(html_frontpage) as fp: soup = BeautifulSoup(fp,"html.parser") found_data = soup.find(class_= 'front-page__feature-image') found_data.string = databasedata If I try to use found_data

How to get favicon by using beautiful soup and python

与世无争的帅哥 提交于 2020-05-13 06:23:07
问题 I wrote some stupid code for learning just, but it doesn't work for any sites. here is the code: import urllib2, re from BeautifulSoup import BeautifulSoup as Soup class Founder: def Find_all_links(self, url): page_source = urllib2.urlopen(url) a = page_source.read() soup = Soup(a) a = soup.findAll(href=re.compile(r'/.a\w+')) return a def Find_shortcut_icon (self, url): a = self.Find_all_links(url) b = '' for i in a: strre=re.compile('shortcut icon', re.IGNORECASE) m=strre.search(str(i)) if m

How to get favicon by using beautiful soup and python

£可爱£侵袭症+ 提交于 2020-05-13 06:21:19
问题 I wrote some stupid code for learning just, but it doesn't work for any sites. here is the code: import urllib2, re from BeautifulSoup import BeautifulSoup as Soup class Founder: def Find_all_links(self, url): page_source = urllib2.urlopen(url) a = page_source.read() soup = Soup(a) a = soup.findAll(href=re.compile(r'/.a\w+')) return a def Find_shortcut_icon (self, url): a = self.Find_all_links(url) b = '' for i in a: strre=re.compile('shortcut icon', re.IGNORECASE) m=strre.search(str(i)) if m

How to insert unescaped html fragment in Beautiful Soup 4

我只是一个虾纸丫 提交于 2020-05-12 04:57:28
问题 I have to parse some nasty government created html (http://www.spokanecounty.org/detentionservices/inmateroster/detail2.aspx?sysid=84060) and to ease my pain I would like to insert some html fragments into the document to wrap some content into more easily digested chunks. BS4, however, escapes the html string fragment I'm trying to insert ( <div class="case"> ) and turns it into this: <div class="case"> The relevant html I'm parsing is this: <div style='float:left; width:100%;border-top

Web scraping through pagination list

前提是你 提交于 2020-05-09 17:14:18
问题 I would like to search through a list (see below figure) that is very long, over 100 pages. The list contains football scouts and I am only interested in those with specific attributes. The attributes can be seen when hovering the mouse over the small boxes on the right side (yellow highlights). The script should extract the scout with 3 attributes = 20. (Discipline 20, Motivation 20, Potential assessment 20). The chosen scouts should be inserted into an excel sheet (see second figure below)

'ascii' codec can't encode character : ordinal not in range (128)

佐手、 提交于 2020-05-09 16:59:10
问题 I'm scraping some webpages using selenium and beautifulsoup. I'm iterating through a bunch of links, grabbing info, and then dumping it into a JSON: for event in events: case = {'Artist': item['Artist'], 'Date': item['Date'], 'Time': item['Time'], 'Venue': item['Venue'], 'Address': item['Address'], 'Coordinates': item['Coordinates']} item[event] = case with open("testScrape.json", "w") as writeJSON: json.dump(item, writeJSON, ensure_ascii=False) When I get to this link: https://www

python爬虫07 | 有了 BeautifulSoup ,妈妈再也不用担心我的正则表达式了

非 Y 不嫁゛ 提交于 2020-05-08 05:07:00
我们上次做了 你的第一个爬虫,爬取当当网 Top 500 本五星好评书籍 有些朋友觉得 利用 正则表达式 去提取信息 太特么麻烦了 有没有什么别的方式 更方便过滤我们想要的内容啊 emmmm 你还别说 还真有 有一个高效的网页解析库 它的名字叫做 BeautifulSoup 那可是 它 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库 那么这么玩呢 ... 接下来就是 学习python的正确姿势 首先我们要安装一下这个库 pip install beautifulsoup4 beautifulsoup支持不同的解析器 比如 对 HTML 的解析 对 XML 的解析 对 HTML5 的解析 你看 一般情况下 我们用的比较多的是 lxml 解析器 我们先来使用一个例子 让你体验一下 beautifulsoup 的一些常用的方法 可流弊了呢 比如我们有这样一段 HTML 代码 html_doc = """ < html > < head > < title > 学习python的正确姿势 </ title > </ head > < body > < p class = "title" > < b > 小帅b的故事 </ b > </ p > < p class = "story" > 有一天,小帅b想给大家讲两个笑话 < a href = "http:/

数据清洗——cleancc简介

て烟熏妆下的殇ゞ 提交于 2020-05-08 03:34:45
  数据清洗——cleancc   cleancc   快速清洗数据内容可以   项目地址   使用方法   pip install cleancc   import cleancc   共有五个函数调用:   1.第一个函数为punct:   [    去除标点并让所有字母小写    :param pop_list:所要处理的的列表格式    :param lower:是否转小写,默认是    :return all_comment:处理后的结果-字符串格式   ]   2.第二个函数为statistics:   [    词频统计    :param pop_list:所要处理的的列表格式    :param symbol:是否去除标点,默认是    :param lower:是否转小写,默认是    :return wordCount_dict:统计结果-字典格式   ]   3.第三个函数为stop_words:   [    删除词频统计中的停顿词    :param statis:是否选择词频清理    :param pop_list:所要处理的的列表格式    :param symbol:是否去除标点,默认是    :param lower:是否转小写,默认是    :param wordCount_dict:词频统计结果-字典    :return