Remove consecutive duplicate words from a file using awk or sed

♀尐吖头ヾ 提交于 2020-05-23 11:40:08

问题


My input file looks like below:

“true true, rohith Rohith;
cold burn, and fact and fact good good?”

Output shoud look like:

"true, rohith Rohith;
cold burn, and fact and fact good?"

i am trying the same with awk, but couldn't able to get the desired result.

awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s ",$i,FS)}{printf("\n")}' input.txt

Could someone please help me here.

Regards, Rohith


回答1:


With GNU awk for the 4th arg to split():

$ cat tst.awk
{
    n = split($0,words,/[^[:alpha:]]+/,seps)
    prev = ""
    for (i=1; i<=n; i++) {
        word = words[i]
        if (word != prev) {
            printf "%s%s", seps[i-1], word
        }
        prev = word
    }
    print ""
}

$ awk -f tst.awk file
“true, rohith Rohith;
cold burn, and fact and fact good?”



回答2:


Just match the same backreference in sed:

sed ':l; s/\(^\|[^[:alpha:]]\)\([[:alpha:]]\{1,\}\)[^[:alpha:]]\{1,\}\2\($\|[^[:alpha:]]\)/\1\2\3/g; tl'

How it works:

  • :l - create a label l to jump to. See tl below.
  • s - substitute
    • /
    • \(^\|[^[:alpha:]]\) - match beginning of the line or non-alphabetic character. This is so that the next part matches the whole word, not only suffix.
    • \([[:alpha:]]\{1,\}\) - match a word - one or more alphabetic characters.
    • [^[:alpha:]]\{1,\} - match a non-word - one or more non-alphabetic characters.
    • \2 - match the same thing as in the second \(...\) - ie. match the word.
    • \($\|[^[:alpha:]]\) - match the end of the line or match a non-alphabetic character. That is so we match the whole second word, not only it's prefix.
    • /
    • \1\2\3 - substitute it for <beginning of the line or non-alphabetic prefix character><the word><end of the line or non-alphabetic suffix character found>
    • /
    • g - substitute globally. But, because regex is never going back, it will substitute 2 words at a time.
  • tl - Jump to label l if last s command was successfull. This is here, so that when there are 3 words the same, like true true true, they are properly replaced by a single true.

Without the \(^\|[^[:alpha:]]\) and \($\|[^[:alpha:]]\), without them for example true rue would be substituted by true, because the suffix rue rue would match.

Below are my other solution, which also remove repeated words across lines.

My first solution was with uniq. So first I will transform the input into pairs with the format <non-alphabetical sequence separating words encoded in hex> <a word>. Then run it via uniq -f1 with ignoring first field and then convert back. This will be very slow:

# recreate input
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
# insert zero byte after each word and non-word
# the -z option is from GNU sed
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
# for each pair (non-word, word)
xargs -0 -n2 sh -c '
    # ouptut hexadecimal representation of non-word
    printf "%s" "$1" | xxd -p | tr -d "\n"
    # and output space with the word
    printf " %s\n" "$2"
' -- |
# uniq ignores empty fields - so make sure field1 always has something
sed 's/^/-/' |
# uniq while ignoring first field
uniq -f1 |
# for each pair (non-word in hex, word)
xargs -n2 bash -c '
    # just `printf "%s" "$1" | sed 's/^-//' | xxd -r -p` for posix shell
    # change non-word from hex to characters
    printf "%s" "${1:1}" | xxd -r -p
    # output word
    printf "%s" "$2"
' --

But then I noticed that sed is doing a good job at tokenizing the input - it places zero bytes between each word and non-word tokens. So I could easily read the stream. I can ignore repeated words in awk by reading zero separated stream in GNU awk and comparing the last readed word:

cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
gawk -vRS='\0' '
NR%2==1{
    nonword=$0
}
NR%2==0{
    if (length(lastword) && lastword != $0) {
        printf "%s%s", lastword, nonword
    }
    lastword=$0
}
END{
    printf "%s%s", lastword, nonword
}'

In place of zero byte something unique could be used as record separator, for example ^ character, that way it could be used with non-GNU awk version, tested with mawk available on repl. Shortened the script by using shorter variable names here:

cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r 's/[[:alpha:]]+/^&^/g' |
awk -vRS='^' '
    NR%2{ n=$0 }
    NR%2-1 && length(l) && l != $0 { printf "%s%s", l, n }
    NR%2-1 { l=$0 }
    END { printf "%s%s", l, n }
'

Tested on repl. The snippets output:

true, rohith Rohith;
cold burn, and fact and fact good?



回答3:


Simple sed:

echo "true true, rohith Rohith;
cold burn, and fact and fact good good?" | sed -r 's/(\w+) (\1)/\1/g'



回答4:


This is not exactly what you have shown in output but is close using gnu-awk:

awk -v RS='[^-_[:alnum:]]+' '$1 == p{printf "%s", RT; next} {p=$1; ORS=RT} 1' file

“true , rohith Rohith;
cold burn, and fact and fact good ?”



回答5:


Depending on your expected input, this might work:

sed -r 's/([a-zA-Z0-9_-]+)( *)\1/\1\2/g ; s/ ([.,;:])/\1/g ; s/  / /g' myfile

([a-zA-Z0-9_-]+) = words that might be repeated.

( *)\1 = check if the previous word is repeated after a space.

s/ ([.,;:])/\1/g = removes extra spaces before punctuation (you might want to add characters to this group).

s/ / /g = removes double spaces.

This works with GNU sed.




回答6:


sed -E 's/(\w+) *\1/\1/g' sample.txt

sample.txt

“true true, rohith Rohith;
cold burn, and fact and fact good good?”

output:

:~$ sed -E 's/(\w+) *\1/\1/g' sample.txt
“true, rohith Rohith;
cold burn, and fact and fact good?”

Explanation

(\w) *\1 - matches a word separated by a space of the same word and saves it



来源:https://stackoverflow.com/questions/59845282/remove-consecutive-duplicate-words-from-a-file-using-awk-or-sed

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!