问题
I am having a hard time figuring out how to perform a regex substitution to clean up some text in a LaTeX file. The LaTeX file looks like
\chapter{\texorpdfstring{{II} {The Chapter
Title}}{II The Chapter Title}}
Annoyingly, this is a multi-line chapter declaration, and the new line can occur virtually anywhere. I can't use the common <> idioms to just read the file line by line and perform the straight-forward regular expression.
Instead, I am trying this:
#!/usr/bin/perl -i.old # In-place edit, backup as '.old'
use strict;
use warnings;
use Path::Tiny;
my $filename = shift or die "Usage: $0 FILENAME";
my $content = path($filename)->slurp_utf8;
$content =~ s|\\chapter\{.*\{[IVXLCDM]*\s*(.*)\}\}|\\chapter{$1}|gms;
path($filename)->spew_utf8($content);
However, the regular expression is far too greedy, and begins a match at the first \chapter declaration and ends it at the last chapter declaration. All I want is to
- remove the
\texorpdfstring. - remove the roman numeral
- remove the multiple appearances of the chapter title
so that my substitution on
\chapter{\texorpdfstring{{I} {The First
Chapter}}{I The First Chapter}}
It was the best of times.
\chapter{\texorpdfstring{{II} {The Second
Chapter}}{II The Second Chapter}}
It was the worst of times.
results in
\chapter{The First Chapter}
It was the best of times.
\chapter{The Second Chapter}
It was the worst of times.
What can I do now?
Edit: I changed the demo text.
If I understood @zdim correctly, he wrote down the substitution without escaping the braces {}'s, to make it easier to validate. Fair enough. I tried @zdim's solution but it output:
\chapter{The First
Chapter}
It was the worst of times.
回答1:
If you can only have the shown pairs of {...}
s/\\chapter{\\texorpdfstring{{ .*? }\s*{ (.*?) }}\s*{.*?}}/\\chapter{$1}/gsx;
or
s/(\\chapter){\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/${1}{$2}/gs;
where ${1} (for $1) is needed for syntax, as $1{... would be interpreted as a value of %1.
Or, rather
s/\\chapter\K{\s*\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/{$1}/gs
where the \K form of lookbehind drops previous matches. I still leave { to retype for a possibly clearer replacement part.
Please sprinkle this with \s* where there may be spaces.
Also note the Path::Tiny::edit_utf8
path($filename)->edit_utf8( sub { s/.../.../gs } ); # regex as above
which applies the anonymous sub to the slurped file, as opposed to edit_lines.
If the braced expressions can be nested more freely (say with {\em ... } and such) a far more systemic approach is needed. See for example Text::Balanced and search for "nested delimiters."
Some regex resources
Perl documentation
perlretut, a tutorial
perlrequick, a quick-start introduction
perlre, the full account of syntax
perlreref, a quick reference (its See Also section is useful on its own)
Stackoverflow
Regex info An entry portal with resources
Reference: What does this regex mean? A gargantuan list of FAQs with links to SO posts
Learning Regular expressions An overview with a long list of resources at the end
Regular-Expressions.info
来源:https://stackoverflow.com/questions/48509953/perl-regular-expression-for-extracting-multi-line-latex-chapter-name