How to separate words in a “sentence” with spaces?

。_饼干妹妹 提交于 2019-11-28 13:23:02
Sinan Ünür

Sometimes, bruteforcing is acceptable:

#!/usr/bin/perl

use strict; use warnings;
use File::Slurp;

my $dict_file = '/usr/share/dict/words';

my @identifiers = qw(
    payperiodmatchcode labordistributioncodedesc dependentrelationship
    actionendoption actionendoptiondesc addresstype addresstypedesc
    historytype psaddresstype rolename bankaccountstatus
    bankaccountstatusdesc bankaccounttype bankaccounttypedesc
    beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
    beneficiaryclass beneficiaryclassdesc benefitactioncode
    benefitactioncodedesc benefitagecontrol benefitagecontroldesc
    ageconrolagelimit ageconrolnoticeperiod
);

my @mydict = qw( desc );

my $pat = join('|',
    map quotemeta,
    sort { length $b <=> length $a || $a cmp $b }
    grep { 2 < length }
    (@mydict, map { chomp; $_ } read_file $dict_file)
);

my $re = qr/$pat/;

for my $identifier ( @identifiers ) {
    my @stack;
    print "$identifier : ";
    while ( $identifier =~ s/($re)\z// ) {
        unshift @stack, $1;
    }
    # mark suspicious cases
    unshift @stack, '*', $identifier if length $identifier;
    print "@stack\n";
}

Output:

payperiodmatchcode : pay period match code
labordistributioncodedesc : labor distribution code desc
dependentrelationship : dependent relationship
actionendoption : action end option
actionendoptiondesc : action end option desc
addresstype : address type
addresstypedesc : address type desc
historytype : history type
psaddresstype : * ps address type
rolename : role name
bankaccountstatus : bank account status
bankaccountstatusdesc : bank account status desc
bankaccounttype : bank account type
bankaccounttypedesc : bank account type desc
beneficiaryamount : beneficiary amount
beneficiaryclass : beneficiary class
beneficiarypercent : beneficiary percent
benefitsubclass : benefit subclass
beneficiaryclass : beneficiary class
beneficiaryclassdesc : beneficiary class desc
benefitactioncode : benefit action code
benefitactioncodedesc : benefit action code desc
benefitagecontrol : benefit age control
benefitagecontroldesc : benefit age control desc
ageconrolagelimit : * ageconrol age limit
ageconrolnoticeperiod : * ageconrol notice period

See also A Spellchecker Used to Be a Major Feat of Software Engineering.

I reduced your list to 32 atomic terms that I was concerned about and put them in longest-first arrangement in a regex:

use strict;
use warnings;

my $qr 
    = qr/ \G # right after last match
          ( distribution 
          | relationship 
          | beneficiary 
          | dependent 
          | subclass 
          | account
          | benefit 
          | address 
          | control 
          | history
          | percent 
          | action 
          | amount
          | conrol 
          | option 
          | period 
          | status 
          | class 
          | labor 
          | limit 
          | match 
          | notice
          | bank
          | code 
          | desc 
          | name 
          | role 
          | type 
          | age 
          | end 
          | pay
          | ps 
          )
    /x;

while ( <DATA> ) { 
    chomp;
    print;
    print ' -> ', join( ' ', m/$qr/g ), "\n";
}

__DATA__
payperiodmatchcode
labordistributioncodedesc
dependentrelationship
actionendoption
actionendoptiondesc
addresstype
addresstypedesc
historytype
psaddresstype
rolename
bankaccountstatus
bankaccountstatusdesc
bankaccounttype
bankaccounttypedesc
beneficiaryamount
beneficiaryclass
beneficiarypercent
benefitsubclass
beneficiaryclass
beneficiaryclassdesc
benefitactioncode
benefitactioncodedesc
benefitagecontrol
benefitagecontroldesc
ageconrolagelimit
ageconrolnoticeperiod

Two things occur to me:

  • this just isn't a task you can confidently attack programmatically, because ... English words don't work like that, they're often made of other words, so, is a given string "reportage" or "report age"? "Timepiece" or "time piece"?
  • One way to do attack the problem would be to use anag which finds anagrams. After all, "time piece" is an anagram of "timepiece" ... now you just have to weed out the false positives.

Here is a Lua program that tries longest matches from a dictionary:

local W={}
for w in io.lines("/usr/share/dict/words") do
    W[w]=true
end

function split(s)
    for n=#s,3,-1 do
        local w=s:sub(1,n)
        if W[w] then return w,split(s:sub(n+1)) end
    end
end

for s in io.lines() do
    print(s,"-->",split(s))
end

Given that some words could be substrings of others, especially with multiple words smashed together, I think simple solutions like regexes are out. I'd go with a full-on parser, my experience being with ANTLR. If you want to stick with perl, I've had good luck using ANTLR parsers generated as Java through Inline::Java.

Peter Norvig has a great python script that has a word segmentation function using unigram/bigram statistics. You want to take a look at the logic for the function segment2 in ngrams.py. Details are in the chapter Natural Language Corpus Data from the book Beautiful Data (Segaran and Hammerbacher, 2009). http://norvig.com/ngrams/

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!