Efficient way to extract text from between tags

元气小坏坏 提交于 2021-02-04 13:44:11

问题


Suppose I have something like this:

var = '<li> <a href="/...html">Energy</a>
      <ul>
      <li> <a href="/...html">Coal</a> </li>
      <li> <a href="/...html">Oil </a> </li>
      <li> <a href="/...html">Carbon</a> </li>
      <li> <a href="/...html">Oxygen</a> </li'

What is the best (most efficient) way to extract the text in between the tags? Should I use regex for this? My current technique relies on splitting the string on li tags and using a for loop, just wondering if there was a faster way to do this.


回答1:


You can use Beautiful Soup that is very good for this kind of task. It is very straightforward, easy to install and with a large documentation.

Your example has some li tags not closed. I already made the corrections and this is how would be to get all the li tags

from bs4 import BeautifulSoup

var = '''<li> <a href="/...html">Energy</a></li>
    <ul>
    <li><a href="/...html">Coal</a></li>
    <li><a href="/...html">Oil </a></li>
    <li><a href="/...html">Carbon</a></li>
    <li><a href="/...html">Oxygen</a></li>'''

soup = BeautifulSoup(var)

for a in soup.find_all('a'):
  print a.string

It will print:

Energy
Coa
Oil
Carbon
Oxygen

For documentation and more examples see the BeautifulSoup doc




回答2:


The recommended way to extract information from a markup language is to use a parser, for instance Beautiful Soup is a good choice. Avoid using regular expressions for this, it's not the right tool for the job!




回答3:


If you're only after parsing what's inside the tags, try using xpath e.g.

for text in var.xpath_all(".//ul/li"):
     text = li.xpath('.//a/text()')
     print text

You can also use urllib, BeautifulSoup, etc.




回答4:


if you want to go the regex route (which some people believe is a sin for parsing HTML/XML), you could try something like this:

re.findall('(?<=>)([^<]+)(?=</a>[^<]*</li)', var, re.S)

Personally, I think regex is fine for one-offs or simple use-cases, but you need to be very careful in writing your regex, so as not to create patterns that can be unexpectedly greedy. For complex document parsing, it is always best to go with a module like BeautifulSoup.



来源:https://stackoverflow.com/questions/17181631/efficient-way-to-extract-text-from-between-tags

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!