8.正则表达式和XPath
1.使用正则表达式爬取内涵段子 import requests import re def loadPage(page): url = "http://www.neihan8.com/article/list_5_" +page+".html" #User-Agent头 user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT6.1; Trident/5.0' headers = {'User-Agent': user_agent} response = requests.get(url,headers=headers) response.encoding = 'gbk' html = response.text return html if __name__=="__main__": page=input('请输入要爬取的页面:') html=loadPage(page) # with open('a.html','w') as f: # f.write(html) # 找到所有的段子内容<div class="f18 mb20"></div> # re.S 如果没有re.S 则是只匹配一行有没有符合规则的字符串,如果没有则下一行重新匹配 # 如果加上re.S 则是将所有的字符串将一个整体进行匹配,找到(.*?