An algorithm to find common edits

前端未结

关注

 3  1978

暗喜 2021-02-20 13:35

I\'ve got two word lists, an example:

 list 1  list 2

 foot    fuut
 barj    kijo
 foio    fuau
 fuim    fuami
 kwim    kwami
 lnun    lnun
 kizm    kazm

3条回答

一向 (楼主)

2021-02-20 13:43

My final solution is to use the mosesdecoder. I split the words into single characters and used them as parallel corpus and used the extracted model. I compared Sursilvan and Vallader.

export IRSTLM=$HOME/rumantsch/mosesdecoder/tools/irstlm
export PATH=$PATH:$IRSTLM/bin

rm -rf corpus giza.* model
array=("sur" "val")
for i in "${array[@]}"; do
    cp "raw.$i" "splitted.$i"
    sed -i 's/ /@/g' "splitted.$i"
    sed -i 's/./& /g' "splitted.$i"
    add-start-end.sh < "splitted.$i" > "compiled.$i"
    build-lm.sh -i "compiled.$i" -t ./tmp -p -o "compiled.lm.$i"
    compile-lm --text yes "compiled.lm.$i.gz" "compiled.arpa.$i"
done

../scripts/training/train-model.perl --first-step 1 --last-step 5 -root-dir . -corpus splitted -f sur -e val -lm 0:3:$PWD/compiled.arpa.sur -extract-options "--SentenceId" -external-bin-dir ../tools/bin/

$HOME/rumantsch/mosesdecoder/scripts/../bin/extract $HOME/rumantsch/mosesdecoder/rumantsch/splitted.val $HOME/rumantsch/mosesdecoder/rumantsch/splitted.sur $HOME/rumantsch/mosesdecoder/rumantsch/model/aligned.grow-diag-final $HOME/rumantsch/mosesdecoder/rumantsch/model/extract 7 --SentenceId --GZOutput

zcat model/extract.sid.gz | awk -F '[ ][|][|][|][ ]' '$1!=$2{print $1, "|", $2}' | sort | uniq -c | sort -nr | head -n 10 > results

0 讨论(0)

查看其它3个回答