Skip/remove non-ascii character with sed

前端 未结 6 1018
孤街浪徒
孤街浪徒 2020-12-07 01:19

Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa

I\'ve been trying to use sed to modify email addresses in a .csv but the line ab

相关标签:
6条回答
  • 2020-12-07 01:26

    I came here trying this sed command s/[\x00-\x1F]/ /g;, which gave me the same error message.

    in this case it simply suffices to remove the \x00 from the collation, yielding s/[\x01-\x1F]/ /g;

    Unfortunately it seems like all characters above and including \x7F and some others are disallowed, as can be seen with this short script:

    for (( i=0; i<=255; i++ )); do 
        printf "== $i - \x$(echo "ibase=10;obase=16;$i" | bc) =="
        echo '' | sed -E "s/[\d$i-\d$((i+1))]]//g"
    done
    

    Note that the problem is only the use of those characters to specify a range. You can still list them all manually or per script. E.g. to come back to your example:

    sed -i 's/[\d128-\d255]//' FILENAME
    

    would become

    c=; for (( i=128; i<255; i++ )); do c="$c\d$i"; done
    sed -i 's/['"$c"']//' FILENAME
    

    which would translate to:

    sed -i 's/[\d128\d129\d130\d131\d132\d133\d134\d135\d136\d137\d138\d139\d140\d141\d142\d143\d144\d145\d146\d147\d148\d149\d150\d151\d152\d153\d154\d155\d156\d157\d158\d159\d160\d161\d162\d163\d164\d165\d166\d167\d168\d169\d170\d171\d172\d173\d174\d175\d176\d177\d178\d179\d180\d181\d182\d183\d184\d185\d186\d187\d188\d189\d190\d191\d192\d193\d194\d195\d196\d197\d198\d199\d200\d201\d202\d203\d204\d205\d206\d207\d208\d209\d210\d211\d212\d213\d214\d215\d216\d217\d218\d219\d220\d221\d222\d223\d224\d225\d226\d227\d228\d229\d230\d231\d232\d233\d234\d235\d236\d237\d238\d239\d240\d241\d242\d243\d244\d245\d246\d247\d248\d249\d250\d251\d252\d253\d254\d255]//' FILENAME
    
    0 讨论(0)
  • 2020-12-07 01:33
    sed -i 's/[^[:print:]]//' FILENAME
    

    Also, this acts like dos2unix

    0 讨论(0)
  • 2020-12-07 01:36

    How about using awk for this. We setup the Field Separator to nothing. Then loop over each character. Use an if loop to check if it matches our character class. If it does we print it else we ignore it.

    awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i}'
    

    Test:

    [jaypal:~/Temp] echo "Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa" | 
    awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i}'
    Chip,Dirkland,DrobSphere Inc,cdirkland@hotmail.com,usa
    

    Update:

    awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i; printf "\n"}' < datafile.csv > asciidata.csv
    

    I have added printf "\n" after the loop to keep the lines separate.

    0 讨论(0)
  • 2020-12-07 01:40

    This might work for you (GNU sed):

    echo "Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa" |
    sed 's/\o346/a+e/g'
    Chip,Dirkland,Droba+eSphere Inc,cdirkland@hotmail.com,usa
    

    Then do what you have to do and after to revert do:

    echo "Chip,Dirkland,Droba+eSphere Inc,cdirkland@hotmail.com,usa" | 
    sed 's/a+e/\o346/g'
    Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa
    

    If you have tricky characters in strings and want to understand how sed sees them use the l0 command (see here). Also very useful for debugging difficult regexps.

    echo "Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa" | 
    sed -n 'l0'
    Chip,Dirkland,Drob\346Sphere Inc,cdirkland@hotmail.com,usa$
    
    0 讨论(0)
  • 2020-12-07 01:46

    The issue you are having is the local.

    if you want to use a collation range like that you need to change the character type and the collation type.

    This fails as \x80 -> \xff are invalid in a utf-8 string. note \u0080 != \x80 for utf8.

    anyway to get this to work just do

    LC_ALL=C sed -i 's/[\d128-\d255]//' FILENAME
    

    this will override LC_CTYPE and LC_COLLATE for the one command and do what you want.

    0 讨论(0)
  • 2020-12-07 01:48

    In this case there is a way to just skip non-ASCII chars, not bothering with removing.

    LANG=C sed /someemailpattern/
    

    See https://bugzilla.redhat.com/show_bug.cgi?id=440419 and Will sed (and others) corrupt non-ASCII files?.

    0 讨论(0)
提交回复
热议问题