一、导入库
import re from urllib.request import urlopen # 内置的包 来获取网页的源代码 字符串
urlopen 来获取网页的源代码 字符串
res = urlopen('https://www.cnblogs.com/zhuangdd/p/12644081.html')
print(res.read().decode('utf-8'))
——————————————————————————————
<!DOCTYPE html>
<html lang="zh-cn">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="referrer" content="origin" />
<meta property="og:description" content="帮助学习的工具 http://tool.chinaz.com/regex/ 字符组 []在一个字符的位置上能出现的内容[1bc] 是一个范围[0-9][A-Z][a-z] 匹配三个字符[abc0-9]" />
<meta http-equiv="Cache-Control" content="no-transform" />
<meta http-equiv="Cache-Control" content="no-siteapp" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<title>25 -1 正则 re模块 (find
、、、、、、、、等等

flags有很多可选值: re.I(IGNORECASE)忽略大小写,括号内是完整的写法 re.M(MULTILINE)多行模式,改变^和$的行为 re.S(DOTALL)点可以匹配任意字符,包括换行符 re.L(LOCALE)做本地化识别的匹配,表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境,不推荐使用 re.U(UNICODE) 使用\w \W \s \S \d \D使用取决于unicode定义的字符属性。在python3中默认使用该flag re.X(VERBOSE)冗长模式,该模式下pattern字符串可以是多行的,忽略空白字符,并可以添加注释

def getPage(url):
response = urlopen(url)
return response.read().decode('utf-8')
def parsePage(s): # s 网页源码
ret = com.finditer(s)
for i in ret:
ret = {
"id": i.group("id"),
"title": i.group("title"),
"rating_num": i.group("rating_num"),
"comment_num": i.group("comment_num")
}
yield ret
def main(num):
url = 'https://movie.douban.com/top250?start=%s&filter=' % num # 0
response_html = getPage(url) # response_html是这个网页的源码 str
ret = parsePage(response_html) # 生成器
print(ret)
f = open("move_info7", "a", encoding="utf8")
for obj in ret:
print(obj)
data = str(obj)
f.write(data + "\n")
f.close()
com = re.compile(
'<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>'
'.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)
count = 0
for i in range(10):
main(count) # count = 0
count += 25
来源:https://www.cnblogs.com/zhuangdd/p/12644200.html
