Replace `\n` in html page with space in python LXML

本小妞迷上赌 提交于 2019-12-25 17:26:06

问题


I have an unclear xml and process it with python lxml module. I want replace all \n in content with space before any processing, how can I do this work for text of all elements.

edit my xml example:

<root>
    <a> dsdfs\n dsf\n sdf\n</a>
    <bds> 
        <d>sdf\n\n\n\n\n\n</d>
        <d>sdf\n\n\nsdf\nsdf\n\n</d>
    </bds>
    ....
    ....
    ....
    ....
</root>

and i wan't to get this in output when i print ittertext:

root = #get root element
for i in root.ittertext():
   print i

dsdfs  dsf  sdf
dsdfs  dsf  sdf
sdf  nsdf sdf  

回答1:


Below code will parse the xml into a string, then replace \n with space and then write to a new xml file. You can do other processing in between, depending what exactly you want to do.

from lxml import etree 
tree = etree.parse('some.xml') 
root = tree.getroot()
# Get the whole XML content as  string
xml_in_str = etree.tostring(root)

# Replace all \n with space
new_xml_data = xml_in_str.replace(r'\n', ' ')

# Do the processing with the new_xml_data string which is formatted

# Maybe also write to a new XML file, without the \n
with open('newxml.xml', 'w') as f:
    f.write(new_xml_data)

some.xml looks like:

<root>
    <a> dsdfs\n dsf\n sdf\n</a>
    <bds> 
        <d>sdf\n\n\n\n\n\n</d>
        <d>sdf\n\n\nsdf\nsdf\n\n</d>
    </bds>
    <bds> 
        <d>sdf\n\n\n\n\n\n</d>
        <d>sdf\n\n\nsdf\nsdf\n\n</d>
    </bds>
    <bds> 
        <d>sdf\n\n\n\n\n\n</d>
        <d>sdf\n\n\nsdf\nsdf\n\n</d>
    </bds>
</root>

newxml.xml looks like:

<root>
    <a> dsdfs  dsf  sdf </a>
    <bds> 
        <d>sdf      </d>
        <d>sdf   sdf sdf  </d>
    </bds>
    <bds> 
        <d>sdf      </d>
        <d>sdf   sdf sdf  </d>
    </bds>
    <bds> 
        <d>sdf      </d>
        <d>sdf   sdf sdf  </d>
    </bds>
</root>



回答2:


What exactly is the code you have tried? strings are immutable for starters and there is no "replaceall" method in Python

for i in root_elem.itertext():
    j = i.replace('\n',' ')
    print(j+'\n')  # or some fp.write call to a new file


来源:https://stackoverflow.com/questions/25419567/replace-n-in-html-page-with-space-in-python-lxml

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!