Extract figures from latex file

问题

Hi I could use a hand with the following problem. I'm trying to write a python script that would extract the figures from a tex file and put them into another file. The input file is something like this:

\documentclass[]....
\begin{document}

% More text

\begin{figure}    
figure_info 1
\end{figure}

\begin{figure}    
figure_info 2
\end{figure}    

%More text

And the output file should be something like this:

\begin{figure}    
figure_info 1
\end{figure}

\begin{figure}    
figure_info 2
\end{figure}

Thanks for the help.

回答1:

Thanks a lot for the answers I've finally done it this way. It probably isn't the optimal way but it works. I tried several of the proposed solutions but they need some tweaking to get them to work.

infile = open('data.tex', 'r')
outfile = open('result.tex', 'w')
extract_block = False
for line in infile:
    if 'begin{figure}' in line:
        extract_block = True
    if extract_block:
        outfile.write(line)
    if 'end{figure}' in line:
        extract_block = False
        outfile.write("------------------------------------------\n\n")

infile.close()
outfile.close()

回答2:

You can do it with regular expression (re module) findall() function. The things to note are:

use the re.DOTALL flag to allow "." to match newlines,
the "lazy" operator on that dot (the question mark in ".*?") which means the regex won't greedily run past the first \end{figure} in search of the longest possible match
make sure your regex string is a r'raw string' otherwise you have to escape every regex backslash to "\\" and a literal backslash in the regex to "\\\\". The same goes for hard-coded input strings.

Here we go:

import re

TEXT = r"""\documentclass[]....
\begin{document}

% More text

\begin{figure}
figure_info 1
\end{figure}

\begin{figure}
figure_info 2
\end{figure}

%More text
"""

RE = r'(\\begin\{figure\}.*?\\end\{figure\})'

m = re.findall(RE, TEXT, re.DOTALL)

if m:
    for match in m:
        print match
        print '' #blank line

回答3:

I would probably take the easy way out and read the whole file into a string variable. This

import string

f = open('/tmp/workfile', 'r')
f = f.read()

text = string.split(f,"\begin{figure} ")

text.pop(0)

for a in text:
    a = string.split(a,"\end{figure}")
    print "\begin{figure}\n"
    print a[0]
    print "\end{figure}"

You could execute this from the command line like this:

your_script.py > output_file.tex

回答4:

import re

# re.M means match across line boundaries
# re.DOTALL means the . wildcard matches \n newlines as well
pattern = re.compile('\\\\begin\{figure\}.*?\\\\end\{figure\}', re.M|re.DOTALL)

# 'with' is the preferred way of opening files; it
#    ensures they are always properly closed
with open("file1.tex") as inf, open("fileout.tex","w") as outf:
    for match in pattern.findall(inf.read()):
        outf.write(match)
        outf.write("\n\n")

Edit: found the problem - not in the regex, but in the test text I was matching against (I forgot to escape the \b's in it).

来源：https://stackoverflow.com/questions/11054008/extract-figures-from-latex-file

标签

python

text