问题
I have a huge XML file (about 100MB) and each line contains something along the lines of <tag>10005991</tag>
. So for example:
textextextextext<tag>10005991<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>10005993</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
I want to replace any string between the tags and that begins with "1" to be replaced with a string of my choice and then write back to the file. I've tried using the line.replace function which works but only if I specify the string.
line=line.replace("<tag>10005991</tag>","<tag>YYYYYY</tag>")
Ideal output:
textextextextext<tag>YYYYYY<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
I've thought about using an array to pass each string in and then replace but I'm sure there's a much simpler solution.
回答1:
You can use the re module
>>> text = 'textextextextext<tag>10005991</tag>textextextextext'
>>> re.sub(r'<tag>1(\d+)</tag>','<tag>YYYYY</tag>',text)
'textextextextext<tag>YYYYY</tag>textextextextext'
re.sub
will replace the matched text with the second argument.
Quote from the doc
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
Usage may be like:
with open("file") as f:
for i in f:
with open("output") as f2:
f2.write(re.sub(r'<tag>1(\d+)</tag>','<tag>YYYYY</tag>',i))
回答2:
You can use regex but as you have a multi-line string you need to use re.DOTALL flag , and in your pattern you can use positive look-around for match string between tags:
>>> print re.sub(r'(?<=<tag>)1\d+(?=</?tag>)',r'YYYYYY',s,re.DOTALL,re.MULTILINE)
textextextextext<tag>YYYYYY<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
Also as @Bhargav Rao have did in his answer you can use grouping instead look-around :
>>> print re.sub(r'<tag>(1\d+)</?tag>',r'<tag>YYYYYY</?tag>',s,re.DOTALL,re.MULTILINE)
textextextextext<tag>YYYYYY</?tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</?tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
回答3:
I think your best bet is to use ElementTree
The main idea: 1) Parse the file 2) Find the elements value 3) Test your condition 4) Replace value if condition met
Here is a good place to start parsing : How do I parse XML in Python?
来源:https://stackoverflow.com/questions/28566554/replace-string-between-tags-if-string-begins-with-1