How to remove duplicate words from a string in a Bash script?

二次信任 提交于 2021-02-05 04:55:03

问题


I have a string containing duplicate words, for example:

abc, def, abc, def

How can I remove the duplicates? The string that I need is:

abc, def

回答1:


We have this test file:

$ cat file
abc, def, abc, def

To remove duplicate words:

$ sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//' file
abc, def

How it works

  • :a

    This defines a label a.

  • s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g

    This looks for a duplicated word consisting of alphanumeric characters and removes the second occurrence.

  • ta

    If the last substitution command resulted in a change, this jumps back to label a to try again.

    In this way, the code keeps looking for duplicates until none remain.

  • s/(, )+/, /g; s/, *$//

    These two substitution commands clean up any left over comma-space combinations.

Mac OSX or other BSD System

For Mac OSX or other BSD system, try:

sed -E -e ':a' -e 's/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g' -e 'ta' -e 's/(, )+/, /g' -e 's/, *$//' file

Using a string instead of a file

sed easily handles input either from a file, as shown above, or from a shell string as shown below:

$ echo 'ab, cd, cd, ab, ef' | sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//'
ab, cd, ef



回答2:


You can use awk to do this.

Example:

#!/bin/bash
string="abc, def, abc, def"
string=$(printf '%s\n' "$string" | awk -v RS='[,[:space:]]+' '!a[$0]++{printf "%s%s", $0, RT}')
string="${string%,*}"
echo "$string"

Output:

abc, def



回答3:


This can also be done in pure Bash:

#!/bin/bash

string="abc, def, abc, def"

declare -A words

IFS=", "
for w in $string; do
  words+=( [$w]="" )
done

echo ${!words[@]}

Output

def abc

Explanation

words is an associative array (declare -A words) and every word is added as a key to it:

words+=( [${w}]="" )

(We do not need its value therefore I have taken "" as value).

The list of unique words is the list of keys (${!words[@]}).

There is one caveat thought, the output is not separated by ", ". (You will have to iterate again. IFS is only used with ${words[*]} and even than only the first character of IFS is used.)




回答4:


I have another way for this case. I changed my input string such as below and run command to editing it:

#string="abc def abc def"
$ echo "abc def abc def" | xargs -n1 | sort -u | xargs |  sed "s# #, #g"
abc, def

Thanks for all support!




回答5:


The problem with an associative array or xargs and sort in the other examples is, that the words become sorted. My solution only skips words that already have been processed. The associative array map keeps this information.

Bash function

function uniq_words() {

  local string="$1"
  local delimiter=", "  
  local words=""

  declare -A map

  while read -r word; do
    # skip already processed words
    if [ ! -z "${map[$word]}" ]; then
      continue
    fi

    # mark the found word
    map[$word]=1

    # don't add a delimiter, if it is the first word
    if [ -z "$words" ]; then
      words=$word
      continue
    fi

    # add a delimiter and the word
    words="$words$delimiter$word"

  # split the string into lines so that we don't have
  # to overwrite the $IFS system field separator
  done <<< $(sed -e "s/$delimiter/\n/g" <<< "$string")

  echo ${words}
}

Example 1

uniq_words "abc, def, abc, def"

Output:

abc, def

Example 2

uniq_words "1, 2, 3, 2, 1, 0"

Output:

1, 2, 3, 0

Example with xargs and sort

In this example, the output is sorted.

echo "1 2 3 2 1 0" | xargs -n1 | sort -u | xargs |  sed "s# #, #g"

Output:

0, 1, 2, 3


来源:https://stackoverflow.com/questions/30294915/how-to-remove-duplicate-words-from-a-string-in-a-bash-script

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!