什么是爬虫

网络爬虫是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。

举个例子，用户浏览豆瓣时，作为活生生的人看到的是带有样式的HTML。

爬虫看到的仅仅是HTML的文本内容。

爬虫就是将这些标签里的内容提取出来而已。它要做的事情非常的简单。但是实现起来需要挺多的步骤。

URL代表着什么

URL全称统一资源定位符，在它的背后基本上有两种可能：1.一个固定的页面 2. 一个处理请求的程序（如servlet）。故当访问一个URL时，爬虫需要解析的可能是页面或者是json数据。（当然也有可能是类似json的其他格式化数据）

用代码发送HTTP请求

日常生活中，浏览器是访问互联网的工具。它替我们完成了一次次的HTTP请求，服务器在接收到请求后，就是返回响应的内容给浏览器。像一张HTML页面，返回过来之后，在浏览器中进行解析，最后得到平时看到的页面。

那么想想一下一个没有界面的浏览器，并且接收的response也不会解析成好看的网页，只是输出单纯的HTML代码。这就是用代码发送请求，进而获得response的过程。

对返回的HTML代码进行处理

public void parseDemo(){
	HttpResponse response = HttpRequest.request("http://www.baidu.com");
}

模拟一个获取response的代码。那么返回的HTML代码就在response当中，假设response 有一个成员变量叫String htmlContent，里面保存着刚刚的HTML代码。想想这只是一个很长的字符串，需要人为解析它，并获取其中需要的数据。通常这种操作会用到解析库，通过规定成固定的文本格式，可以方便的对其中的内容进行提取。

对结果的操作

这一部分就很简单了，结果可以放进数据库或者生成文件的形式。

代码示例

import requests
from requests.exceptions import RequestException
from bs4 import BeautifulSoup
import time


def get_one_page(url):
    try:
        response = requests.get(url, headers={'User-Agent':
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) '
        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'})
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None


def get_html_target(html):
    soup = BeautifulSoup(html,'lxml')
    target_div = soup.find_all('dd')
    for single_target in target_div:
        movie_index = single_target.find('i').get_text().strip()
        movie_img = single_target.find('img', class_='board-img')['data-src']
        movie_name = single_target.find('p',class_='name').a['title'] #or single_target.find('a').get('title')
                                                    #就是获取一个标签属性可以用中括号，或者用get方法
        movie_actors = single_target.find('p', attrs={'class': 'star'}).get_text().strip()
        movie_time = single_target.find('p', attrs={'class': 'releasetime'}).get_text().strip()
        score_integer = single_target.find(class_='integer').get_text().strip()
        score_fraction = single_target.find(class_='fraction').get_text().strip()
        full_score = score_integer + score_fraction
        yield {
            '排名':movie_index,
            '图片地址':movie_img,
            '电影名':movie_name,
            '主演':movie_actors,
            '上映时间':movie_time,
            '电影评分':full_score
        }


def write_to_file(content):
    for item in content:
        with open('猫眼Top100', 'a', encoding='utf-8') as f:
            f.write(str(item) + '\n')
            f.close()



def main(offset):
    url = 'http://www.maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    content = get_html_target(html)
    write_to_file(content)


if __name__ == "__main__":
    for i in range(10):
        main(i*10)
        time.sleep(1)
    print("program complete")

来源：oschina

链接：https://my.oschina.net/u/4536753/blog/4355133

标签

strip

def

beautifulsoup