How Do I Remove An XML Declaration Using BeautifulSoup4

杀马特。学长 韩版系。学妹 提交于 2019-12-10 11:30:21

问题


I have an XHTML file that is structured like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>

I'm using BeautifulSoup and I want to remove the XML declaration from the document, so what I have looks like this:

<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>

I can't find a way to get at the XML declaration to remove it. It doesn't appear to be a Doctype, Declaration, Tag, or NavigableString as far as I can tell. Is there a way I can find this to extract it?

As a working example, I can remove the Doctype with code like this (assuming the document text is the variable "html"):

soup = BeautifulSoup(html)
[item.extract() for item in soup.contents if isinstance(item, Doctype)]

回答1:


You could use the following approach:

import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

for e in soup:
    if isinstance(e, bs4.element.ProcessingInstruction):
        e.extract()
        break


来源:https://stackoverflow.com/questions/33207503/how-do-i-remove-an-xml-declaration-using-beautifulsoup4

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!