Replace string between tags if string begins with “1”

醉酒当歌 提交于 2021-02-05 08:23:08

问题


I have a huge XML file (about 100MB) and each line contains something along the lines of <tag>10005991</tag>. So for example:

textextextextext<tag>10005991<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>10005993</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext

I want to replace any string between the tags and that begins with "1" to be replaced with a string of my choice and then write back to the file. I've tried using the line.replace function which works but only if I specify the string.

line=line.replace("<tag>10005991</tag>","<tag>YYYYYY</tag>")

Ideal output:

textextextextext<tag>YYYYYY<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext

I've thought about using an array to pass each string in and then replace but I'm sure there's a much simpler solution.


回答1:


You can use the re module

>>> text = 'textextextextext<tag>10005991</tag>textextextextext'
>>> re.sub(r'<tag>1(\d+)</tag>','<tag>YYYYY</tag>',text)
'textextextextext<tag>YYYYY</tag>textextextextext'

re.sub will replace the matched text with the second argument.

Quote from the doc

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.

Usage may be like:

with open("file") as f:
    for i in f:
       with open("output") as f2:
           f2.write(re.sub(r'<tag>1(\d+)</tag>','<tag>YYYYY</tag>',i))



回答2:


You can use regex but as you have a multi-line string you need to use re.DOTALL flag , and in your pattern you can use positive look-around for match string between tags:

>>> print re.sub(r'(?<=<tag>)1\d+(?=</?tag>)',r'YYYYYY',s,re.DOTALL,re.MULTILINE)
textextextextext<tag>YYYYYY<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Also as @Bhargav Rao have did in his answer you can use grouping instead look-around :

>>> print re.sub(r'<tag>(1\d+)</?tag>',r'<tag>YYYYYY</?tag>',s,re.DOTALL,re.MULTILINE)
textextextextext<tag>YYYYYY</?tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</?tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext



回答3:


I think your best bet is to use ElementTree

The main idea: 1) Parse the file 2) Find the elements value 3) Test your condition 4) Replace value if condition met

Here is a good place to start parsing : How do I parse XML in Python?



来源:https://stackoverflow.com/questions/28566554/replace-string-between-tags-if-string-begins-with-1

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!