sed: remove whole words containg a character class

问题

I'd like to remove any word which contains a non alpha char from a text file. e.g

"ok 0bad ba1d bad3 4bad4 5bad5bad5"

should become

"ok"

I've tried using

echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/\b[a-zA-Z]*[^a-zA-Z]\+[a-zA-Z]*\b/ /g'

回答1:

Using awk:

s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
awk '{ofs=""; for (i=1; i<=NF; i++) if ($i ~ /^[[:alpha:]]+$/)
         {printf "%s%s", ofs, $i; ofs=OFS} print ""}' <<< "$s"
ok

This awk command loops through all words and if word matches the regex /^[[:alpha:]]+$/ then it writes to standard out. (i<NF)?OFS:RS is a short cut to add OFS if current field no is less than NF otherwise it writes RS.

Using grep + tr together:

s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
r=$(grep -o '[^ ]\+' <<< "$s"|grep '^[[:alpha:]]\+$'|tr '\n' ' ')
echo "$r"
ok

First grep -o breaks the string into individual words. 2nd grep only searches for words with alphabets only. ANd finally tr translates \n to space.

回答2:

The following sed command does the job:

sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'

It removes all words containing at least one non-alphabetic character. It is better to use POSIX character classes like [:alpha:], because for instance they won't consider the French name "François" as being faulty (i.e. containing a non-alphabetic character).

Explanation

We remove all patterns starting with an arbitrary number of spaces followed by an arbitrary (possibly nil) number of alphabetic characters, followed by at least one non-space and non-alphabetic character, and then glob to the end of the word (i.e. until the next space). Please note that you may want to swap [:space:] for [:blank:], see this page for a detailed explanation of the difference between these two POSIX classes.

Test

$ echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
ok

回答3:

If you're not concerned about losing different numbers of spaces between each word, you could use something like this in Perl:

perl -ane 'print join(" ", grep { !/[^[:alpha:]]/ } @F), "\n"

the -a switch enables auto-split mode, which splits the text on any number of spaces and stores the fields in the array @F. grep filters out the elements of that array that contain any non-alphabetical characters. The resulting array is joined on a single space.

回答4:

This might work for you (GNU sed):

sed -r 's/\b([[:alpha:]]+\b ?)|\S+\b ?/\1/g;s/ $//' file

This uses a back reference within alternation to save the required string.

回答5:

 st="ok 0bad ba1d bad3 4bad4 5bad5bad5"
 for word in $st; 
     do 
     if [[ $word =~  ^[a-zA-Z]+$ ]]; 
         then 
             echo $word; 
      fi; 
 done

来源：https://stackoverflow.com/questions/25158710/sed-remove-whole-words-containg-a-character-class

标签

Linux

bash

command-line

awk

sed