豆瓣高分电影500部的信息爬取

試著忘記壹切 提交于 2020-02-08 23:48:31

第一步:明确需求

1. 分析数据来源的规律

2. 获取豆瓣高分电影的具体信息的访问链接

3. 利用具体信息的url 获取所有信息

4. 将2和3两张数据表连接成一张表格,并保存在Excel中

第二步:分析数据存储路径

豆瓣高分电影存储位置:

源访问链接:

url = 'https://movie.douban.com/explore#!type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start=0'

通过此链接寻找到数据加载链接:

url = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start=0'

发现通过改变page_limit=xxxx可以获取更多信息,当page_limit=500时电影数量不在增加。

因此可以通过这个url获取所有高分电影的电影名和访问链接:

1 # 访问链接
2 url = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=1000&page_start=0'
3 # 设置请求头
4 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'}
5 r = requests.get(url, headers = headers, timeout = 30).json()
6 columns = ['title', 'rate', 'id', 'url']
7 movie_info = pd.DataFrame(r['subjects'], columns=columns)
8 movie_info.head(2)

因为是json数据格式储存所以将其解析,获取需要的信息。

接下来利用获取的电影id,构建具体信息访问链接:

url_info = 'https://movie.douban.com/j/subject_abstract?subject_id=' + id

代码实现:

 1 m_info = []
 2 for url_i in movie_info['id']:
 3     url_info = 'https://movie.douban.com/j/subject_abstract?subject_id=' + url_i
 4     r = requests.get(url_info, headers = headers, timeout = 30).json()
 5     info = {}
 6     
 7     try:
 8         info['actors1'] = r['subject']['actors'][0]
 9         info['actors2'] = r['subject']['actors'][1]
10         info['actors3'] = r['subject']['actors'][2]
11     except:
12         info['actors2'] = '/'
13         info['actors3'] = '/'
14     info['directors'] = r['subject']['directors'][0]
15     info['duration'] = r['subject']['duration']
16     info['rate'] = r['subject']['rate']
17     info['types1'] = r['subject']['types'][0]
18     try:
19         info['types2'] = r['subject']['types'][1]
20         info['types3'] = r['subject']['types'][2]
21     except:
22         info['types2'] = '/'
23         info['types3'] = '/'
24     info['region'] = r['subject']['region']
25     info['release_year'] = r['subject']['release_year']
26     m_info.append(info)

利用pandas将具体信息转换成表格形式:

1 df_info = pd.DataFrame(m_info)
2 # 删除重复字段
3 del df_info['rate']
4 movie_data = movie_info.join(df_info)
5 
6 # 写入到Excel中
7 movie_data.to_excel('豆瓣高分电影500部.xlsx',index = False)

 

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!