25 -2 正则爬虫例子

一、导入库

import re
from urllib.request import urlopen    # 内置的包 来获取网页的源代码 字符串

urlopen 来获取网页的源代码字符串

res = urlopen('https://www.cnblogs.com/zhuangdd/p/12644081.html')
print(res.read().decode('utf-8'))

——————————————————————————————
<!DOCTYPE html>
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta name="referrer" content="origin" />
    <meta property="og:description" content="帮助学习的工具 http://tool.chinaz.com/regex/ 字符组 []在一个字符的位置上能出现的内容[1bc] 是一个范围[0-9][A-Z][a-z] 匹配三个字符[abc0-9]" />
    <meta http-equiv="Cache-Control" content="no-transform" />
    <meta http-equiv="Cache-Control" content="no-siteapp" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <title>25 -1  正则    re模块 （find
、、、、、、、、等等

flags有很多可选值：

re.I(IGNORECASE)忽略大小写，括号内是完整的写法
re.M(MULTILINE)多行模式，改变^和$的行为
re.S(DOTALL)点可以匹配任意字符，包括换行符
re.L(LOCALE)做本地化识别的匹配，表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境，不推荐使用
re.U(UNICODE) 使用\w \W \s \S \d \D使用取决于unicode定义的字符属性。在python3中默认使用该flag
re.X(VERBOSE)冗长模式，该模式下pattern字符串可以是多行的，忽略空白字符，并可以添加注释

flags

def getPage(url):
    response = urlopen(url)
    return response.read().decode('utf-8')

def parsePage(s):   # s 网页源码
    ret = com.finditer(s)
    for i in ret:
        ret = {
            "id": i.group("id"),
            "title": i.group("title"),
            "rating_num": i.group("rating_num"),
            "comment_num": i.group("comment_num")
        }
        yield ret

def main(num):
    url = 'https://movie.douban.com/top250?start=%s&filter=' % num  # 0
    response_html = getPage(url)   # response_html是这个网页的源码 str
    ret = parsePage(response_html) # 生成器
    print(ret)
    f = open("move_info7", "a", encoding="utf8")
    for obj in ret:
        print(obj)
        data = str(obj)
        f.write(data + "\n")
    f.close()

com = re.compile(
        '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>'
        '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)
count = 0
for i in range(10):
    main(count)  # count = 0
    count += 25

豆瓣250代码

来源：https://www.cnblogs.com/zhuangdd/p/12644200.html

标签

正则

python爬虫