beautifulsoup

How to scrape Instagram with BeautifulSoup

泄露秘密 提交于 2021-01-16 07:52:55
问题 I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right? Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the

asyncio和aiohttp

最后都变了- 提交于 2021-01-14 00:34:51
asynci o官网 https://docs.python.org/zh-cn/3/library/asyncio-task.html 下面为伪代码: import aiohttp import asyncio from bs4 import BeautifulSoup import pandas as pd # 将数据存入 li = []或数据库 #获取页面 li = [] async def fetch(url,session): async with session. get (url) as response: return await response.text # 解析网页 async def parse(html): soup = BeautifulSoup(html, ' lxml ' ) # 获取网页中的畅销书 book_list =soup.find( ' ul ' ,class_= ' book_list ' )( ' li ' ) for book in book_list: info =book.find_all( ' div ' ) # 获取每本畅销书的排名,名称,评论数,作者,出版社 rank = info[ 0 ].text[ 0 :- 1 ] name = info[ 2 ].text comments = info[ 3 ].text

python爬虫爬小说,来不及解释了。

北城以北 提交于 2021-01-13 07:39:24
网上没新的txt下载,自己想办法下载来看,可以根据网页标签不同来修改。 还没入门,不智能慢慢研究吧。 bs4要装一下,pip install BeautifulSoup4 1 # coding utf-8 2 import urllib.request 3 from bs4 import BeautifulSoup 4 import time 5 import re 6 7 def get_html(url): 8 page = urllib.request.urlopen(url) 9 html = page.read() 10 # print(bytes.decode(html)) 11 return html 12 13 ''' 14 page='https://www.xuehong.cc/book/36273/' 15 p1 = BeautifulSoup(get_html(page).decode('utf-8'), 'html.parser') 16 p2=[] 17 for p in p1.find_all('a',): 18 p2.append(p['href']) 19 print(p2) 20 ''' 21 22 p3=[ ' /book/36273/31737154.html ' , ' /book/36273/31737155.html ' , '

python实现爬取指定bilibili视频的弹幕并制作词云

匆匆过客 提交于 2021-01-12 20:01:30
先看下最终实现的效果 具体实现思路是 1.爬取带有弹幕信息的网页 2.处理爬取得到的内容并提取所需要的弹幕信息,然后写入文本中 3.通过词云库将文本处理成想要的图片 所需要用到的库 import requests from bs4 import BeautifulSoup import pandas as pd import re import jieba from wordcloud import WordCloud from scipy.misc import imread import matplotlib.pyplot as plt 首先爬取想要的信息 ps(哔哩哔哩的弹幕全部保存在 http://comment.bilibili.com/ 122512779 .xml 中,红色字体为该视频的cid,可以在当前视频页通过:查看网页源代码—ctrl+f查找cid-出现的第一个9位cid,来获取) url = ' http://comment.bilibili.com/.xml ' # 对方的url header = { ' User-Agent ' : ' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari

python爬虫-爬取你想要的小姐姐

旧城冷巷雨未停 提交于 2021-01-10 08:56:50
一、准备 1. 原地址 2. 检查html发现,网页是有规则的分页, 最大图片的class为pic-large 二、代码 1 import requests 2 import os 3 from bs4 import BeautifulSoup 4 5 url = ' http://www.win4000.com/wallpaper_detail_157712.html ' 6 imgmkdir = ' D://Download//ghost_1// ' 7 8 9 # 获取网页url 10 def getUrlList(): 11 imgUrlList = [] 12 for i in range(0, 10 ): 13 imgUrl = '' 14 url_split = url.split( ' .html ' ) 15 if not i == 0: 16 imgUrl += url_split[0] + ' _ ' + str(i) + ' .html ' 17 # print(imgUrl) 18 imgUrlList.append(imgUrl) 19 20 return imgUrlList 21 22 23 # 下载图片 24 def downImg(imgUrl): 25 try : 26 if not os.path.exists(imgmkdir): 27

Limited number of scraped data?

时光总嘲笑我的痴心妄想 提交于 2021-01-07 02:51:06
问题 I am scraping a website and everything seems work fine from today's news until news published in 2015/2016. After these years, I am not able to scrape news. Could you please tell me if anything has changed? I should get 672 pages getting titles and snippets from this page: https://catania.liveuniversity.it/attualita/ but I have got approx. 158. The code that I am using is: import bs4, requests import pandas as pd import re headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11

Python下载网络图片方法汇总与实现

懵懂的女人 提交于 2021-01-06 15:31:03
> 本文介绍下载python下载网络图片的方法,包括通过图片url直接下载、通过re/beautifulSoup解析html下载以及对动态网页的处理等。 >​本期小编推送2021初学者一定会用到的Python资料,含有小编自己呕心沥血整理的免费书籍/视频/在线文档和编辑器/源代码,关于`Python`的安装qun:850973621 ### 通过pic_url单个/批量下载 已知图片url,例如http://xyz.com/series-*(1,2..N).jpg,共N张图片,其链接形式较为固定,这样经简单循环,直接通过`f.write(requests.get(url).content)'即可以二进制形式将图片写入。 ``` import os import requests def download(file_path, picture_url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE", } r = requests.get(picture_url, headers=headers) with open(file_path

python使用beautifulsoup4爬取酷狗音乐

你。 提交于 2021-01-05 23:45:22
声明:本文仅为技术交流,请勿用于它处。 小编经常在网上听一些音乐但是有一些网站好多音乐都是付费下载的正好我会点爬虫技术,空闲时间写了一份,截止4月底没有问题的,会下载到当前目录,只要按照bs4库就好, 安装方法:pip install beautifulsoup4 完整代码如下:双击就能直接运行 from bs4 import BeautifulSoup import requests import re headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36' } url='https://songsearch.kugou.com/song_search_v2?&page=1&pagesize=30&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1555124510574' #想要爬取别的网页直接修改这个json数据地址就行 r=requests.get(url,headers=headers) soup

BeautifulSoup removing tags

試著忘記壹切 提交于 2021-01-05 08:57:35
问题 I'm trying to remove the style tags and their contents from the source, but it's not working, no errors just simply doesn't decompose. This is what I have: source = BeautifulSoup(open("page.html")) getbody = source.find('body') for child in getbody[0].children: try: if child.get('style') is not None and child.get('style') == "display:none": # it in here child.decompose() except: continue print source # display:hidden div's are still there. 回答1: The following code does what you want and works

Using beautifulsoup to parse string efficiently

心不动则不痛 提交于 2021-01-04 07:27:31
问题 I am trying to parse this html to get the item title (e.g. Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW) <div style="" class=""> <h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  </span>Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW</h1> <h2 id="subTitle" class="it-sttl"> Brand New + Free Shipping, Satisfaction Guaranteed! </h2> <!-- DO NOT change linkToTagId="rwid" as the catalog response