beautifulsoup

用Python写个爬虫小程序,给女朋友每日定时推送睡前小故事

丶灬走出姿态 提交于 2021-02-02 13:52:34
↑ 关注 + 星标 ,每天学Python新技能 后台回复【 大礼包 】送你Python自学大礼 导读: 这篇文章利用简单的Python爬虫、邮件发送以及定时任务实现了每天定时发送睡前小故事的功能,是一篇步骤详尽的文章。 最近,某可爱要求我忙完之后给她每晚睡前讲讲小故事,我想了下,网络上应该有各种资源,小故事也都能搜得到,但是数量比较少,而且格式不够统一,提取比较困难。转念一想,面向儿童的睡前故事可能也比较适用,于是我准备从儿童睡前故事中取材,搜索之后发现有一个适合提取睡前故事的网址: tom61.com/ertongwenxue/ 一共有700则小故事,嗯,一天一则数量可以满足,html格式也比较统一,就决定是它了! 查看网页源代码, ctrl+F 输入查询关键字幸福王国,定位到相关信息: 发现其故事链接包含在 dl 标签中的a标签中的 href 属性, /ertongwenxue/shuiqiangushi/2018-02-25/106432.html 点击后得到完整网址 tom61.com/ertongwenxue/ 接下来要做的就是提取出该链接: 1. 模拟浏览器访问网页,利用requests库请求访问 代码实现: def getHTMLText(url,headers): try: r=requests.get(url,headers=headers,timeout=30)

python学习之python爬虫原理

ε祈祈猫儿з 提交于 2021-02-02 05:45:32
今天我们要向大家详细解说python爬虫原理,什么是python爬虫,python爬虫工作的基本流程是什么等内容,希望对这正在进行python爬虫学习的同学有所帮助! 前言 简单来说互联网是由一个个站点和网络设备组成的大网,我们通过浏览器访问站点,站点把HTML、JS、CSS代码返回给浏览器,这些代码经过浏览器解析、渲染,将丰富多彩的网页呈现我们眼前; 一、爬虫是什么? 如果我们把互联网比作一张大的蜘蛛网,数据便是存放于蜘蛛网的各个节点,而爬虫就是一只小蜘蛛, 沿着网络抓取自己的猎物(数据)爬虫指的是:向网站发起请求,获取资源后分析并提取有用数据的程序; 从技术层面来说就是 通过程序模拟浏览器请求站点的行为,把站点返回的HTML代码/JSON数据/二进制数据(图片、视频) 爬到本地,进而提取自己需要的数据,存放起来使用; 二、爬虫的基本流程: 用户获取网络数据的方式: 方式1:浏览器提交请求--->下载网页代码--->解析成页面 方式2:模拟浏览器发送请求(获取网页代码)->提取有用的数据->存放于数据库或文件中 爬虫要做的就是方式2; 1、发起请求 使用http库向目标站点发起请求,即发送一个Request Request包含:请求头、请求体等 Request模块缺陷:不能执行JS 和CSS 代码 2、获取响应内容 如果服务器能正常响应,则会得到一个Response

Python web scrapping HTML with same class

ε祈祈猫儿з 提交于 2021-01-29 22:37:09
问题 I would like to ask how can i extract the event's fees from this website using python libraries (beautifulSoup) for web scrapping. However, the event's fee share the same class with other properties. I would like to ask is there any suggestions to extract only the fees. I have try find_next , find_next_sibling and find next_parent but still no use. Below is the raw html code where the price's class located: <div class="eds-event-card-content__sub eds-text-bm eds-text-color--ui-600 eds-l-mar

Python web scrapping HTML with same class

假装没事ソ 提交于 2021-01-29 22:33:13
问题 I would like to ask how can i extract the event's fees from this website using python libraries (beautifulSoup) for web scrapping. However, the event's fee share the same class with other properties. I would like to ask is there any suggestions to extract only the fees. I have try find_next , find_next_sibling and find next_parent but still no use. Below is the raw html code where the price's class located: <div class="eds-event-card-content__sub eds-text-bm eds-text-color--ui-600 eds-l-mar

Python web scrapping HTML with same class

我只是一个虾纸丫 提交于 2021-01-29 22:23:11
问题 I would like to ask how can i extract the event's fees from this website using python libraries (beautifulSoup) for web scrapping. However, the event's fee share the same class with other properties. I would like to ask is there any suggestions to extract only the fees. I have try find_next , find_next_sibling and find next_parent but still no use. Below is the raw html code where the price's class located: <div class="eds-event-card-content__sub eds-text-bm eds-text-color--ui-600 eds-l-mar

Web scraping using Python and Beautiful soup: error “'page' is not defined”

笑着哭i 提交于 2021-01-29 22:11:25
问题 From a betting site, I want to collect the betting rates. After inspecting the page, I noticed that these rates were included into a eventprice class. Following the explanation from here, I thus wrote this code in Python, using Beautifulsoup module: from bs4 import BeautifulSoup import urllib.request import re url = "http://sports.williamhill.com/bet/fr-fr" try: page = urllib.request.urlopen(url) except: print("An error occured.") soup = BeautifulSoup(page, 'html.parser') regex = re.compile(

I cannot autologin to pastebin using requests + BeautifulSoup

与世无争的帅哥 提交于 2021-01-29 20:53:39
问题 I am trying to auto-login to pastebin account using python, but im failing and i don't know why. I copied the request headers exactly and double checked... but still i am greeted with 400 HTTP code. Can somebody help me? This is my code: import requests from bs4 import BeautifulSoup import subprocess import os import sys from requests import Session # the actual program page = requests.get("https://pastebin.com/99qQTecB") parse = BeautifulSoup(page.content, 'html.parser') string = parse.find(

How can I scrape code inside div with BeautifulSoup?

烂漫一生 提交于 2021-01-29 19:39:25
问题 I am having problems scraping code inside div not between them, with BeautifulSoup and python. Below I wrote a html code what I want to scrap (data-friendscount , data-followerscount Values) <div data-profileuserid="285904056" data-friendscount="100" data-followerscount="7102" data-followingscount="25" data-arefriends="false" class="hidden ng-isolate-scope"></div> 回答1: toc = requests.get(f'roblox.com/users/75790059/profile') soup = BeautifulSoup(toc.content, 'html.parser') divs = soup.find

soup.findAll returning empty list

梦想的初衷 提交于 2021-01-29 19:10:15
问题 I am trying to scrape with soup and am obtaining an empty set when I call findAll from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url='https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/SearchDisplayView?catalogId=10123&langId=44&storeId=10151&krypto=70KutR16JmLgr7Ka%2F385RFXrzDpOkSqx%2FRC3DnlU09%2BYcw0pR5cfIfC0kOlQywiD%2BTEe7ppq8ENXglbpqA8sDUtif1h3ZjrEoQkV29%2B90iqljHi2gm2T%2BDZHH2%2FCNeKB%2BkVglbz%2BNx1bKsSfE5L6SVtckHxg%2FM%2F

Not able to do webscrapping using beautifulsoup and requests

夙愿已清 提交于 2021-01-29 19:02:31
问题 I am trying to scrape the first two sections values i.e 1*2 and DOUBLECHANCE sections values using bs4 and requests from this website https://web.bet9ja.com/Sport/SubEventDetail?SubEventID=76512106 The code which I written is: import bs4 as bs import urllib.request source = urllib.request.urlopen('https://web.bet9ja.com/Sport/SubEventDetail?SubEventID=76512106') soup = bs.BeautifulSoup(source,'lxml') for div in soup.find_all('div', class_='SEItem ng-scope'): print(div.text) when I run I am