An algorithm to find common edits

前端 未结 3 1978
暗喜
暗喜 2021-02-20 13:35

I\'ve got two word lists, an example:

 list 1  list 2

 foot    fuut
 barj    kijo
 foio    fuau
 fuim    fuami
 kwim    kwami
 lnun    lnun
 kizm    kazm
         


        
3条回答
  •  一向
    一向 (楼主)
    2021-02-20 13:43

    My final solution is to use the mosesdecoder. I split the words into single characters and used them as parallel corpus and used the extracted model. I compared Sursilvan and Vallader.

    export IRSTLM=$HOME/rumantsch/mosesdecoder/tools/irstlm
    export PATH=$PATH:$IRSTLM/bin
    
    rm -rf corpus giza.* model
    array=("sur" "val")
    for i in "${array[@]}"; do
        cp "raw.$i" "splitted.$i"
        sed -i 's/ /@/g' "splitted.$i"
        sed -i 's/./& /g' "splitted.$i"
        add-start-end.sh < "splitted.$i" > "compiled.$i"
        build-lm.sh -i "compiled.$i" -t ./tmp -p -o "compiled.lm.$i"
        compile-lm --text yes "compiled.lm.$i.gz" "compiled.arpa.$i"
    done
    
    ../scripts/training/train-model.perl --first-step 1 --last-step 5 -root-dir . -corpus splitted -f sur -e val -lm 0:3:$PWD/compiled.arpa.sur -extract-options "--SentenceId" -external-bin-dir ../tools/bin/
    
    $HOME/rumantsch/mosesdecoder/scripts/../bin/extract $HOME/rumantsch/mosesdecoder/rumantsch/splitted.val $HOME/rumantsch/mosesdecoder/rumantsch/splitted.sur $HOME/rumantsch/mosesdecoder/rumantsch/model/aligned.grow-diag-final $HOME/rumantsch/mosesdecoder/rumantsch/model/extract 7 --SentenceId --GZOutput
    
    zcat model/extract.sid.gz | awk -F '[ ][|][|][|][ ]' '$1!=$2{print $1, "|", $2}' | sort | uniq -c | sort -nr | head -n 10 > results
    

提交回复
热议问题