How to tell BeautifulSoup to extract the content of a specific tag as text? (without touching it)

柔情痞子 提交于 2019-11-29 22:31:03

问题


I need to parse an html document which contains "code" tags

I'm getting the code blocks like this:

soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

The problem is, if i have a code tag like this:

<code class="csharp">
    List<Person> persons = new List<Person>();
</code>

BeautifulSoup forse the closing of nested tags and transform the code block into:

<code class="csharp">
    List<person> persons = new List</person><person>();
    </person>
</code>

is there any way to extract the content of the code tags as text with BeautifulSoup without letting it fix what IT thinks are html markup errors?


回答1:


Add the code tag to the QUOTE_TAGS dictionary.

from BeautifulSoup import BeautifulSoup

content = "<code class='csharp'>List<Person> persons = new List<Person>();</code>"

BeautifulSoup.QUOTE_TAGS['code'] = None
soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

Output:

[<code class="csharp"> List<Person> persons = new List<Person>(); </code>]


来源:https://stackoverflow.com/questions/4922969/how-to-tell-beautifulsoup-to-extract-the-content-of-a-specific-tag-as-text-wit

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!