Perl Regular Expression for extracting multi-line LaTeX chapter name

ε祈祈猫儿з 提交于 2019-12-13 16:53:30

问题


I am having a hard time figuring out how to perform a regex substitution to clean up some text in a LaTeX file. The LaTeX file looks like

\chapter{\texorpdfstring{{II} {The Chapter 
Title}}{II The Chapter Title}}

Annoyingly, this is a multi-line chapter declaration, and the new line can occur virtually anywhere. I can't use the common <> idioms to just read the file line by line and perform the straight-forward regular expression.

Instead, I am trying this:

#!/usr/bin/perl -i.old     # In-place edit, backup as '.old'
use strict;
use warnings;

use Path::Tiny;

my $filename = shift or die "Usage: $0 FILENAME";
my $content = path($filename)->slurp_utf8;

$content =~ s|\\chapter\{.*\{[IVXLCDM]*\s*(.*)\}\}|\\chapter{$1}|gms;
path($filename)->spew_utf8($content);

However, the regular expression is far too greedy, and begins a match at the first \chapter declaration and ends it at the last chapter declaration. All I want is to

  1. remove the \texorpdfstring.
  2. remove the roman numeral
  3. remove the multiple appearances of the chapter title

so that my substitution on

\chapter{\texorpdfstring{{I} {The First 
Chapter}}{I The First Chapter}}

It was the best of times.

\chapter{\texorpdfstring{{II} {The Second 
Chapter}}{II The Second Chapter}}

It was the worst of times.

results in

\chapter{The First Chapter}

It was the best of times.

\chapter{The Second Chapter}

It was the worst of times.

What can I do now?

Edit: I changed the demo text.


If I understood @zdim correctly, he wrote down the substitution without escaping the braces {}'s, to make it easier to validate. Fair enough. I tried @zdim's solution but it output:

\chapter{The First
Chapter}

It was the worst of times.

回答1:


If you can only have the shown pairs of {...}

s/\\chapter{\\texorpdfstring{{ .*? }\s*{ (.*?) }}\s*{.*?}}/\\chapter{$1}/gsx;

or

s/(\\chapter){\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/${1}{$2}/gs;

where ${1} (for $1) is needed for syntax, as $1{... would be interpreted as a value of %1.

Or, rather

s/\\chapter\K{\s*\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/{$1}/gs

where the \K form of lookbehind drops previous matches. I still leave { to retype for a possibly clearer replacement part.

Please sprinkle this with \s* where there may be spaces.

Also note the Path::Tiny::edit_utf8

path($filename)->edit_utf8( sub { s/.../.../gs } );  # regex as above

which applies the anonymous sub to the slurped file, as opposed to edit_lines.

If the braced expressions can be nested more freely (say with {\em ... } and such) a far more systemic approach is needed. See for example Text::Balanced and search for "nested delimiters."


Some regex resources

Perl documentation

  • perlretut, a tutorial

  • perlrequick, a quick-start introduction

  • perlre, the full account of syntax

  • perlreref, a quick reference (its See Also section is useful on its own)

Stackoverflow

  • Regex info   An entry portal with resources

  • Reference: What does this regex mean? A gargantuan list of FAQs with links to SO posts

  • Learning Regular expressions   An overview with a long list of resources at the end

Regular-Expressions.info



来源:https://stackoverflow.com/questions/48509953/perl-regular-expression-for-extracting-multi-line-latex-chapter-name

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!