事故:
今天写vpnbook.py的时候(参见vpnbook.py),遇到一个问题,匹配到太多的数据,而且是我不需要的。
我要对某个html进行解析,又为了跨平台和快速使用,就没有使用第三方库(比如BeautifulSoup)
获得的html如下

1 <!DOCTYPE html>
2
3 <!--[if lt IE 7 ]><html class="ie ie6 no-js" lang="en"> <![endif]-->
4
5 <!--[if IE 7 ]><html class="ie ie7 no-js" lang="en"> <![endif]-->
6
7 <!--[if IE 8 ]><html class="ie ie8 no-js" lang="en"> <![endif]-->
8
9 <!--[if IE 9 ]><html class="ie ie9 no-js" lang="en"> <![endif]-->
10
11 <!--[if (gte IE 9)|!(IE)]><!--><html class="no-js" lang="en"> <!--<![endif]-->
12
13 <head>
14
15 <meta charset="utf-8">
16
17 <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
18
19
20
21 <title>Free VPN Accounts • 100% Free PPTP and OpenVPN Service</title>
22
23 <meta name="description" content="Free VPN Service – VPNBook.com is the #1 premium Free VPN Server account provider. US, UK, and offshore VPN servers available.">
24
25 <meta name="keywords" content="free vpn, free vpn service, free vpn server, free vpn account, openvpn, pptp vpn, web proxy" />
26
27 <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=0">
28
29 <meta name="apple-mobile-web-app-capable" content="yes">
30
31
32
33 <link rel="stylesheet" href="/css/skeleton-v1.1.css">
34
35 <link rel="stylesheet" href="/css/flexslider-v1.8.css">
36
37 <link rel="stylesheet" href="/css/main-r6.css?0211">
38
39 <link rel="stylesheet" href="/css/media-queries-r6.css">
40
41 <link rel="stylesheet" href="/css/sprites-r6.css">
42
43 <link rel="stylesheet" href="/css/theme-default-r6.css">
44
45 <link href='https://fonts.googleapis.com/css?family=Open+Sans:400,600,400italic' rel='stylesheet' type='text/css'>
46
47
48
49 <link rel="shortcut icon" href="/images/favicon.ico">
50
51 <link rel="apple-touch-icon" href="/images/apple-touch-icon.png">
52
53 <link rel="apple-touch-icon" sizes="72x72" href="/images/apple-touch-icon-72x72.png">
54
55 <link rel="apple-touch-icon" sizes="114x114" href="/images/apple-touch-icon-114x114.png">
56
57
58
59 <!-- Allow IE to render HTML5 -->
60
61 <!--[if lt IE 9]>
62
63 <script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
64
65 <![endif]-->
66
67
68
69 <script>
70
71 (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
72
73 (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
74
75 m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
76
77 })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
78
79 ga('create', 'UA-40096058-1', 'vpnbook.com');
80
81 ga('send', 'pageview');
82
83 </script>
84
85
86
87 </head>
88
89
90
91 <body>
92
93 <div id="main" role="main">
94
95
96
97 <header>
98
99 <div class="container">
100
101 <h1 class="logo one-third column alpha">
102
103 <a href="/">
104
105 <img src="/images/logo.png" alt="Free VPN" class="scale-with-grid" />
106
107 <img src="/images/logo-mobile.png" class="scale-with-grid mobile-only" alt="" /><!-- Alternative image for mobile devices -->
108
109 </a>
110
111 </h1>
112
113
114
115 <nav class="menu two-thirds column omega">
116
117 <ul>
118
119 <li><a href="/" >VPNBook<br /><span>news</span></a></li>
120
121 <li><a href="/freevpn" class='active'>Free VPN<br /><span>accounts</span></a></li>
122
123 <li><a href="/webproxy" >Free Web<br /><span>proxy</span></a></li>
124
125 <li><a href="/howto" >How-To<br /><span>setup</span></a></li>
126
127 <li><a href="/features" >Features<br /><span>service</span></a></li>
128
129 <li><a href="/contact" >Privacy<br /><span>contact</span></a></li>
130
131 </ul>
132
133 </nav>
134
135 </div><!-- .container -->
136
137
138
139 <div class="bottom-gradient">
140
141 <span class="left"></span>
142
143 <span class="center"></span>
144
145 <span class="right"></span>
146
147 </div>
148
149 </header>
150
151
152
153 <article id="pricing">
154
155 <div class="container">
156
157
158
159 <div class="sixteen columns titleset">
160
161 <h2 class="remove-bottom">Free VPN</h2>
162
163 <h6 class="subheader">PPTP and OpenVPN Accounts<br> </h6>
164
165
166
167 <div class="align-center adsense">
168 <script type="text/javascript"><!--
169 google_ad_client = /* Vpnbook LB-1 */
170 "ca-pub-3860002410887566";
171 google_ad_slot = "5227255820";
172 google_ad_width = 728;
173 google_ad_height = 90;
174 //-->
175 </script>
176 <script type="text/javascript" src="http://pagead2.googlesyndication.com/pagead/show_ads.js"></script>
177 </div>
178 </div>
179
180
181
182 <div class="one-third column">
183
184 <div class="headset price clearfix">
185
186 <img src="/images/empty.gif" alt="" class="large-icons icon-font" />
187
188 <h4>Free PPTP VPN</h4>
189
190 <span><sup>$</sup>0<sub>/mo</sub></span>
191
192 </div>
193
194 <div class="bottom-gradient add-top add-bottom">
195
196 <span class="left"></span>
197
198 <span class="center"></span>
199
200 <span class="right"></span>
201
202 </div>
203
204 <p>PPTP (point to point tunneling) is widely used since it is supported across all Microsoft Windows,
205
206 Linux, Apple, Mobile and PS3 platforms. It is however easier to block and might not work if your ISP or
207
208 government blocks the protocol. In that case you need to use OpenVPN, which is impossible to detect or block.</p>
209
210
211
212 <ul class="disc">
213
214 <li><strong>euro195.vpnbook.com</strong></li>
215
216 <li><strong>euro213.vpnbook.com</strong></li>
217
218 <li><strong>uk180.vpnbook.com</strong> <span class="red">(UK VPN - optimized for fast web surfing; no p2p downloading)</span></li>
219
220 <li><strong>us1.vpnbook.com</strong> <span class="red">(US VPN - optimized for fast web surfing; no p2p downloading)</span></li>
221
222 <li>Username: <strong>vpnbook</strong></li>
223
224 <li>Password: <strong>bRudre3u</strong></li>
225
226 </ul>
227
228
229
230 <div><strong><span class="green"> More servers coming. Please Donate.</span></strong></div>
231
232
233
234 </div><!-- .columns -->
235
236
237
238
239
240 <div class="one-third column box light featured">
241
242 <div class="headset price clearfix">
243
244 <img src="/images/empty.gif" alt="" class="large-icons icon-font" />
245
246 <h4>Free OpenVPN <br><small>(Recommended)</small></h4>
247
248 <span><sup>$</sup>0<sub>/mo</sub></span>
249
250 </div>
251
252
253
254 <div class="bottom-gradient add-top add-bottom">
255
256 <span class="left"></span>
257
258 <span class="center"></span>
259
260 <span class="right"></span>
261
262 </div>
263
264
265
266 <p>OpenVPN is the best and most recommended open-source VPN software world-wide. It is the most secure VPN option.
267
268 You need to download the open-source <a href="/howto">OpenVPN Client</a> and our configuration and certificate bundle
269
270 from the links below (use TCP if you cannot connect to UDP due to network restriction).</p>
271
272
273
274 <ul class="disc">
275
276 <li><a href="/free-openvpn-account/VPNBook.com-OpenVPN-Euro1.zip">Euro1 OpenVPN Certificate Bundle</a> </li>
277
278 <li><a href="/free-openvpn-account/VPNBook.com-OpenVPN-Euro2.zip">Euro2 OpenVPN Certificate Bundle</a> </li>
279
280 <li><a href="/free-openvpn-account/VPNBook.com-OpenVPN-UK1.zip">UK OpenVPN Certificate Bundle</a> <span class="red">(optimized for fast web surfing; no p2p downloading)</span></li>
281
282 <li><a href="/free-openvpn-account/VPNBook.com-OpenVPN-US1.zip">US OpenVPN Certificate Bundle</a> <span class="red">(optimized for fast web surfing; no p2p downloading)</span></li>
283
284 <li>All bundles include UDP53, UDP 25000, TCP 80, TCP 443 profile</li>
285
286 <li>Username: <strong>vpnbook</strong></li>
287
288 <li>Password: <strong>bRudre3u</strong></li>
289
290 </ul>
291
292 <a class="button featured animate">Choose an OpenVPN Server from above</a>
293
294 </div><!-- .columns -->
295
296
297
298 <div class="one-third column">
299
300 <div class="headset price clearfix">
301
302 <img src="/images/empty.gif" alt="" class="large-icons icon-burst" />
303
304 <h4>Donate</h4>
305
306
307
308 </div>
309
310 <div class="bottom-gradient add-top add-bottom">
311
312 <span class="left"></span>
313
314 <span class="center"></span>
315
316 <span class="right"></span>
317
318 </div>
319
320
321
322 <div class="align-center" id="paypalDonate">
323
324 <iframe src="http://causera.org/donation_app/r?i=4e1bf7d46c3e3c1747fc0a588042e547" id="wiframe" style="width: 205px; height: 163px; border: 0px; overflow: hidden;" scrolling="no"></iframe>
325
326 </div>
327
328
329
330 <div class="align-center">
331
332 <br />
333
334 <strong>Bitcoin Donation</strong>
335
336 <br />
337
338 <small>1FFExjn6sm2oMZ2LJsTtn1t8uXW6EE7HQ7</small>
339
340 </div>
341
342
343
344 <div class="socialMediaContainer2">
345
346 <div class="plusoneButton"><g:plusone size="tall" href="http://www.vpnbook.com"></g:plusone></div>
347
348 <div class="facebookButton"><div id="fb-root"></div><fb:like href="http://www.vpnbook.com" send="false" layout="box_count" width="65" show_faces="false" font="arial"></fb:like></div>
349
350 <div class="twitterButton"><a href="http://twitter.com/share" rel="nofollow" class="twitter-share-button" data-url="http://www.vpnbook.com" data-text="VPNBook.com Free VPN Service" data-count="vertical">Tweet</a></div>
351
352 </div>
353
354
355
356 </div><!-- .columns -->
357
358
359
360 <div style="clear:both;"> </div>
361
362 <div class="align-center adsense">
363 <script type="text/javascript"><!--
364 google_ad_client = /* Vpnbook LB-2 */
365 "ca-pub-3860002410887566";
366 google_ad_slot = "9781879828";
367 google_ad_width = 728;
368 google_ad_height = 90;
369 //-->
370 </script>
371 <script type="text/javascript"
372 src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
373 </script>
374 </div>
375
376
377 </div><!-- container -->
378
379 </article>
380
381
382
383 </div><!-- #main -->
384
385
386
387 <!-- share this code -->
388
389 <!-- end share this -->
390
391 <div id="share-icons">
392
393 <div id="share-mask"></div>
394
395 <div class="addthis_toolbox addthis_default_style addthis_32x32_style">
396
397 <a class="addthis_button_facebook"></a>
398
399 <a class="addthis_button_twitter"></a>
400
401 <a class="addthis_button_linkedin"></a>
402
403 <a class="addthis_button_google_plusone"></a>
404
405 <a class="addthis_button_compact"></a>
406
407 </div>
408
409 <script type="text/javascript" src="http://s7.addthis.com/js/250/addthis_widget.js#pubid=ra-4f92e19745ac6401"></script>
410
411 <!-- AddThis Button END -->
412
413
414
415 <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js"></script>
416
417 <script>if(!window.jQuery) {document.write('<script src="/js/jquery-1.8.2.min.js"><\/script>');}</script>
418
419 <script src="/js/jquery.flexslider-v1.8.min.js"></script>
420
421 <script src="/js/jquery.ba-hashchange-v1.3.min.js"></script>
422
423 <script src="/js/main-r6.js?051201"></script>
424
425 <script type="text/javascript" src="http://apis.google.com/js/plusone.js"></script>
426
427 <script src="http://connect.facebook.net/en_US/all.js#appId=221296744562462&xfbml=1"></script>
428
429 <script>!function (d, s, id) { var js, fjs = d.getElementsByTagName(s)[0], p = /^http:/.test(d.location) ? 'http' : 'https'; if (!d.getElementById(id)) { js = d.createElement(s); js.id = id; js.src = p + '://platform.twitter.com/widgets.js'; fjs.parentNode.insertBefore(js, fjs); } }(document, 'script', 'twitter-wjs');</script>
430
431 </body>
432
433 </html>
我真正需要的东西很少,如下:
<ul class="disc">
<li><strong>euro195.vpnbook.com</strong></li>
<li><strong>euro213.vpnbook.com</strong></li>
<li><strong>uk180.vpnbook.com</strong> <span class="red">(UK VPN - optimized for fast web surfing; no p2p downloading)</span></li>
<li><strong>us1.vpnbook.com</strong> <span class="red">(US VPN - optimized for fast web surfing; no p2p downloading)</span></li>
<li>Username: <strong>vpnbook</strong></li>
<li>Password: <strong>bRudre3u</strong></li>
</ul>
刚开始的时候我匹配 <ul class="disc">.*</ul>,正则表达式如下
m = re.search('<ul\sclass=\"disc\">.*</ul>',html,re.S)
匹配到内容却是
<ul class="disc">
<li><strong>euro195.vpnbook.com</strong></li>
<li><strong>euro213.vpnbook.com</strong></li>
<li><strong>uk180.vpnbook.com</strong> <span class="red">(UK VPN - optimized for fast web surfing; no p2p downloading)</span></li>
<li><strong>us1.vpnbook.com</strong> <span class="red">(US VPN - optimized for fast web surfing; no p2p downloading)</span></li>
<li>Username: <strong>vpnbook</strong></li>
<li>Password: <strong>bRudre3u</strong></li>
</ul>
<div><strong><span class="green"> More servers coming. Please Donate.</span></strong></div>
</div><!-- .columns -->
<div class="one-third column box light featured">
<div class="headset price clearfix">
<img src="/images/empty.gif" alt="" class="large-icons icon-font" />
<h4>Free OpenVPN <br><small>(Recommended)</small></h4>
<span><sup>$</sup>0<sub>/mo</sub></span>
</div>
<div class="bottom-gradient add-top add-bottom">
<span class="left"></span>
<span class="center"></span>
<span class="right"></span>
</div>
<p>OpenVPN is the best and most recommended open-source VPN software world-wide. It is the most secure VPN option.
You need to download the open-source <a href="/howto">OpenVPN Client</a> and our configuration and certificate bundle
from the links below (use TCP if you cannot connect to UDP due to network restriction).</p>
<ul class="disc">
<li><a href="/free-openvpn-account/VPNBook.com-OpenVPN-Euro1.zip">Euro1 OpenVPN Certificate Bundle</a> </li>
<li><a href="/free-openvpn-account/VPNBook.com-OpenVPN-Euro2.zip">Euro2 OpenVPN Certificate Bundle</a> </li>
<li><a href="/free-openvpn-account/VPNBook.com-OpenVPN-UK1.zip">UK OpenVPN Certificate Bundle</a> <span class="red">(optimized for fast web surfing; no p2p downloading)</span></li>
<li><a href="/free-openvpn-account/VPNBook.com-OpenVPN-US1.zip">US OpenVPN Certificate Bundle</a> <span class="red">(optimized for fast web surfing; no p2p downloading)</span></li>
<li>All bundles include UDP53, UDP 25000, TCP 80, TCP 443 profile</li>
<li>Username: <strong>vpnbook</strong></li>
<li>Password: <strong>bRudre3u</strong></li>
</ul>
也就是说,它匹配了过多的内容,我需要的是以第一个</ul>结束前的内容,也就是第一个ul标签内的所有内容。
那么,正则中有这样的匹配方式么?特此我请教了我们的大神马哥,大神曰:看文档去。
我在文档中找到了 贪婪与懒惰 这种匹配模式。
修改表达式为
m = re.search('<ul\sclass=\"disc\">.*?</ul>',html,re.S)
匹配到的数据
<ul class="disc">
<li><strong>euro195.vpnbook.com</strong></li>
<li><strong>euro213.vpnbook.com</strong></li>
<li><strong>uk180.vpnbook.com</strong> <span class="red">(UK VPN - optimized for fast web surfing; no p2p downloading)</span></li>
<li><strong>us1.vpnbook.com</strong> <span class="red">(US VPN - optimized for fast web surfing; no p2p downloading)</span></li>
<li>Username: <strong>vpnbook</strong></li>
<li>Password: <strong>bRudre3u</strong></li>
</ul>
仅仅多了一个问号,却是天差地别啊。
下面是 贪婪与懒惰 匹配模式的相关知识。
===================================================================================================
当正则表达式中包含能接受重复的限定符时,通常的行为是(在使整个表达式能得到匹配的前提下)匹配尽可能多的字符。以这个表达式为例:a.*b,它将会匹配最长的以a开始,以b结束的字符串。如果用它来搜索aabab的话,它会匹配整个字符串aabab。这被称为贪婪匹配。
有时,我们更需要懒惰匹配,也就是匹配尽可能少的字符。前面给出的限定符都可以被转化为懒惰匹配模式,只要在它后面加上一个问号?。这样.*?就意味着匹配任意数量的重复,但是在能使整个匹配成功的前提下使用最少的重复。现在看看懒惰版的例子吧:
a.*?b匹配最短的,以a开始,以b结束的字符串。如果把它应用于aabab的话,它会匹配aab(第一到第三个字符)和ab(第四到第五个字符)。
| 代码/语法 | 说明 |
|---|---|
| *? | 重复任意次,但尽可能少重复 |
| +? | 重复1次或更多次,但尽可能少重复 |
| ?? | 重复0次或1次,但尽可能少重复 |
| {n,m}? | 重复n到m次,但尽可能少重复 |
| {n,}? | 重复n次以上,但尽可能少重复 |
来源:https://www.cnblogs.com/tk091/p/3698358.html
