Why doesn't Perl v5.22 find all the sentence boundaries?

问题

This is fixed in Perl 5.22.1. I write about it in Perl v5.22 adds fancy Unicode word boundaries.

Perl v5.22 added the Unicode assertions from TR #29. I've been playing with the sentence boundary assertion, but it only seems to find the start and end of text:

use v5.22;

$_ = "See Spot. (Spot is a dog.) See Spot run. Run Spot, run!\x{2029}New paragraph.";

while( m/\b{sb}/g ) {
    say "Sentence boundary at ", pos;
    }

The output notes sentence boundaries at the start and end of text, but not after the full stops, the sentence terminators, or the parens:

Sentence boundary at 0
Sentence boundary at 70

The Unicode breaks tester shows them mostly I expect them based on TR #29.

I couldn't find any non-trivial tests in the perl source for this feature. I'm digesting technical report to create appropriate test cases, but so far this looks like another untested and broken feature.

回答1:

Calle Dybedahl's comment gets it right (and when they turn it into an answer I'll accept that). This was a broken feature in v5.22.0, and as far as I can tell, untested. I had an issue compiling stuff the latest perls last night and ended the day with the question.

The perl5.22.1 perldelta does not mention the particular changes (and "mention" might be too strong since it merely alludes to possible things that were wrong without enumerating them). It mentions as incompatible change with 5.20.0 (a cut and paste error?), a "single" exception, then more than one issue. The reference to "sane" made me think that all of the changes were related to the panic issue in the next subsection. The mention of "several bugs" with only one rt.perl.org reference made me think those bugs were related to the panic issue.

=head1 Incompatible Changes

There are no changes intentionally incompatible with 5.20.0 other than the following single exception, which we deemed to be a sensible change to make in order to get the new C<\b{wb}> and (in particular) C<\b{sb}> features sane before people decided they're worthless because of bugs in their Perl 5.22.0 implementation and avoided them in the future. If any others exist, they are bugs, and we request that you submit a report. See L below.

=head2 Bounds Checking Constructs

Several bugs, including a segmentation fault, have been fixed with the bounds checking constructs (introduced in Perl 5.22) C<\b{gcb}>, C<\b{sb}>, C<\b{wb}>, C<\B{gcb}>, C<\B{sb}>, and C<\B{wb}>. All the C<\B{}> ones now match an empty string; none of the C<\b{}> ones do. L<[perl #126319]|https://rt.perl.org/Ticket/Display.html?id=126319>

Additionally, perlrebackslash, where the new boundaries are documented, doesn't mention that they don't work in v5.22.0.

I disregarded a possible fix because of incongruities in the perldelta and the prior experience I've had that new features aren't adequately (or even at all) tested in the perl source. I prematurely cut off that line of investigation and could have saved myself a couple of hours. It's certainly my fault for not getting the code running on the latest binaries, but I had become fixated on the idea that I was doing something wrong and that my code was the problem. Despite my numerous past experiences to the contrary, I wasn't entertaining thoughts (other than an update to the UCD) that perl was wrong.

Now that I'm at a different machine and have a working perl-5.22.1, I see that my program works as expected in the point release. The perldelta could have been much better here.

回答2:

I am principally to blame for this situation, but there were others involved, so I will use the plural first person in places below.

First, it is a typo that the perldelta for 5.22.1 says 5.20.0 when it means 5.22.0. It mentions just the one issue, because in our minds they were just one thing, the Unicode break boundaries.

These were added late in 5.22, and we did not realize that there were problems until after 5.22 had shipped. And when problems started showing up, some of them proved to be bugs in the Unicode-specified algorithm, and we presumed all were such.

But everything was tested, and I thought, extensively enough. Recent Unicode releases have included publishing tests for various features, and 5.22.0 passed all those tests. You can find them in lib/unicore/TestProp.pl, which is run every time 'make test' is done, exec'd by t/re/uniprops.t. The ones in question here are called by Test_SB() (over 500) and Test_WB() (almost 1500), and each test consists of several sub-tests. These were more tests than I would have come up with myself.

Independently, someone reported the segfault early in the 5.23 development process. In investigating that, I saw, through code reading, that there were other issues in the code just shipped. The interactions are complex and not easily summarized, so the perldelta did not even try. Both these boundary conditions require tracking the context in which boundaries may occur, often doing look-ahead and/or look-behind. When the code is parsing through the target string, it saves the current context for the next iteration, where it will be the look-behind context, and won't have to be recalculated. This was broken, and the context wasn't always getting saved properly. This is why the Unicode-furnished tests all passed. They were for short inputs, where the context breakage didn't matter. When this had all been fixed, I was pleasantly surprised that \b{sb} was giving results that were more what a human expected.

The Unicode bugs are scheduled to be fixed in the next version of UAX #29, and I think we made the right decision in making \b{wb} and \b{sb} work in 5.22.1.

来源：https://stackoverflow.com/questions/36832662/why-doesnt-perl-v5-22-find-all-the-sentence-boundaries

标签

regex

perl

unicode