第一步:明确需求
1. 分析数据来源的规律
2. 获取豆瓣高分电影的具体信息的访问链接
3. 利用具体信息的url 获取所有信息
4. 将2和3两张数据表连接成一张表格,并保存在Excel中
第二步:分析数据存储路径
豆瓣高分电影存储位置:
源访问链接:
url = 'https://movie.douban.com/explore#!type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start=0'
通过此链接寻找到数据加载链接:
url = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start=0'
发现通过改变page_limit=xxxx可以获取更多信息,当page_limit=500时电影数量不在增加。
因此可以通过这个url获取所有高分电影的电影名和访问链接:
1 # 访问链接
2 url = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=1000&page_start=0'
3 # 设置请求头
4 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'}
5 r = requests.get(url, headers = headers, timeout = 30).json()
6 columns = ['title', 'rate', 'id', 'url']
7 movie_info = pd.DataFrame(r['subjects'], columns=columns)
8 movie_info.head(2)
因为是json数据格式储存所以将其解析,获取需要的信息。
接下来利用获取的电影id,构建具体信息访问链接:
url_info = 'https://movie.douban.com/j/subject_abstract?subject_id=' + id
代码实现:
1 m_info = []
2 for url_i in movie_info['id']:
3 url_info = 'https://movie.douban.com/j/subject_abstract?subject_id=' + url_i
4 r = requests.get(url_info, headers = headers, timeout = 30).json()
5 info = {}
6
7 try:
8 info['actors1'] = r['subject']['actors'][0]
9 info['actors2'] = r['subject']['actors'][1]
10 info['actors3'] = r['subject']['actors'][2]
11 except:
12 info['actors2'] = '/'
13 info['actors3'] = '/'
14 info['directors'] = r['subject']['directors'][0]
15 info['duration'] = r['subject']['duration']
16 info['rate'] = r['subject']['rate']
17 info['types1'] = r['subject']['types'][0]
18 try:
19 info['types2'] = r['subject']['types'][1]
20 info['types3'] = r['subject']['types'][2]
21 except:
22 info['types2'] = '/'
23 info['types3'] = '/'
24 info['region'] = r['subject']['region']
25 info['release_year'] = r['subject']['release_year']
26 m_info.append(info)
利用pandas将具体信息转换成表格形式:
1 df_info = pd.DataFrame(m_info)
2 # 删除重复字段
3 del df_info['rate']
4 movie_data = movie_info.join(df_info)
5
6 # 写入到Excel中
7 movie_data.to_excel('豆瓣高分电影500部.xlsx',index = False)
来源:https://www.cnblogs.com/syd123/p/12271509.html