regex pattern in python for parsing HTML title tags

后端未结

关注

 4  1470

野性不改 2020-12-05 20:09

I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here\'s the code I\'ve writte

4条回答

伪装坚强ぢ (楼主)

2020-12-05 20:26
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.

BeautifulSoup example:
```
from bs4 import BeautifulSoup

response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text
```
Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.

Your specific problem can be solved by matching additional characters within the title tag, optionally:
```
r']*>([^<]+)'
```
This matches 0 or more characters that are not the closing > bracket. The '0 or more' here lets you match both extra attributes and the plain </code> tag.</p> </p> <div class="appendcontent"> </div> </div> <div class="jieda-reply"> <span class="jieda-zan button_agree" type="zan" data-id='867155'> <i class="iconfont icon-zan"></i> <em>0</em> </span> <span type="reply" class="showpinglun" data-id="867155"> <i class="iconfont icon-svgmoban53"></i> 讨论(0) </span> <div class="jieda-admin"> </div> <div class="noreplaytext bb"> <center><div> <a href="https://www.e-learn.cn/qa/q-254235.html"> 查看其它4个回答 </a> </div></center> </div> </div> <div class="comments-mod " style="display: none; float:none;padding-top:10px;" id="comment_867155"> <div class="areabox clearfix"> <form class="layui-form" action=""> <div class="layui-form-item"> <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label> <div class="layui-input-block" style="margin-left:90px;"> <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" /> <input type='hidden' value='0' name='replyauthor' /> </div> <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="867155">提交评论 </span></div> </div> </form> </div> <hr> <ul class="my-comments-list nav"> <li class="loading"> <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' /> 加载中... </li> </ul> </div> </li> </ul> <div class="layui-form layui-form-pane"> <form id="huidaform" name="answerForm" method="post"> <div class="layui-form-item layui-form-text"> <a name="comment"></a> <div class="layui-input-block"> <script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.config.js"></script> <script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.all.js"></script> <script type="text/plain" id="editor" name="content" style="width:100%;height:200px;"></script> <script type="text/javascript"> var isueditor=1; var editor = UE.getEditor('editor',{ //这里可以选择自己需要的工具按钮名称,此处仅选择如下五个 toolbars:[['source','fullscreen', '|', 'undo', 'redo', '|', 'bold', 'italic', 'underline', 'fontborder', 'strikethrough', 'removeformat', 'formatmatch', 'autotypeset', 'blockquote', 'pasteplain', '|', 'forecolor', 'backcolor', 'insertorderedlist', 'insertunorderedlist', 'selectall', 'cleardoc', '|', 'rowspacingtop', 'rowspacingbottom', 'lineheight', '|', 'customstyle', 'paragraph', 'fontfamily', 'fontsize', '|', 'indent', '|', 'justifyleft', 'justifycenter', 'justifyright', 'justifyjustify', '|', 'link', 'unlink', 'anchor', '|', 'simpleupload', 'insertimage', 'scrawl', 'insertvideo', 'attachment', 'map', 'insertcode', '|', 'horizontal', '|', 'preview', 'searchreplace', 'drafts']], initialContent:'', //关闭字数统计 wordCount:false, zIndex:2, //关闭elementPath elementPathEnabled:false, //默认的编辑区域高度 initialFrameHeight:250 //更多其他参数，请参考ueditor.config.js中的配置项 //更多其他参数，请参考ueditor.config.js中的配置项 }); editor.ready(function() { editor.setDisabled(); }); $("#editor").find("*").css("max-width","362px"); </script> </div> </div> <div class="layui-form-item"> <label for="L_vercode" class="layui-form-label">验证码</label> <div class="layui-input-inline"> <input type="text" id="code" name="code" value="" required lay-verify="required" placeholder="图片验证码" autocomplete="off" class="layui-input"> </div> <div class="layui-form-mid"> <span style="color: #c00;"><img class="hand" src="https://www.e-learn.cn/qa/user/code.html" onclick="javascript:updatecode();" id="verifycode"><a class="changecode" href="javascript:updatecode();"> 看不清?</a></span> </div> </div> <div class="layui-form-item"> <input type="hidden" value="254235" id="ans_qid" name="qid"> <input type="hidden" id="tokenkey" name="tokenkey" value=''/> <input type="hidden" value="regex pattern in python for parsing HTML title tags" id="ans_title" name="title"> <div class="layui-btn layui-btn-disabled" id="ajaxsubmitasnwer" >提交回复</div> </div> </form> </div> </div> <input type="hidden" value="254235" id="adopt_qid" name="qid" /> <input type="hidden" id="adopt_answer" value="0" name="aid" /> </div> <div class="layui-col-md4">  <dl class="fly-panel fly-list-one"> <dt class="fly-panel-title">热议问题</dt>