Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa
I\'ve been trying to use sed to modify email addresses in a .csv but the line ab
I came here trying this sed command s/[\x00-\x1F]/ /g;
, which gave me the same error message.
in this case it simply suffices to remove the \x00
from the collation, yielding s/[\x01-\x1F]/ /g;
Unfortunately it seems like all characters above and including \x7F
and some others are disallowed, as can be seen with this short script:
for (( i=0; i<=255; i++ )); do
printf "== $i - \x$(echo "ibase=10;obase=16;$i" | bc) =="
echo '' | sed -E "s/[\d$i-\d$((i+1))]]//g"
done
Note that the problem is only the use of those characters to specify a range. You can still list them all manually or per script. E.g. to come back to your example:
sed -i 's/[\d128-\d255]//' FILENAME
would become
c=; for (( i=128; i<255; i++ )); do c="$c\d$i"; done
sed -i 's/['"$c"']//' FILENAME
which would translate to:
sed -i 's/[\d128\d129\d130\d131\d132\d133\d134\d135\d136\d137\d138\d139\d140\d141\d142\d143\d144\d145\d146\d147\d148\d149\d150\d151\d152\d153\d154\d155\d156\d157\d158\d159\d160\d161\d162\d163\d164\d165\d166\d167\d168\d169\d170\d171\d172\d173\d174\d175\d176\d177\d178\d179\d180\d181\d182\d183\d184\d185\d186\d187\d188\d189\d190\d191\d192\d193\d194\d195\d196\d197\d198\d199\d200\d201\d202\d203\d204\d205\d206\d207\d208\d209\d210\d211\d212\d213\d214\d215\d216\d217\d218\d219\d220\d221\d222\d223\d224\d225\d226\d227\d228\d229\d230\d231\d232\d233\d234\d235\d236\d237\d238\d239\d240\d241\d242\d243\d244\d245\d246\d247\d248\d249\d250\d251\d252\d253\d254\d255]//' FILENAME
sed -i 's/[^[:print:]]//' FILENAME
Also, this acts like dos2unix
How about using awk
for this. We setup the Field Separator to nothing. Then loop over each character. Use an if loop
to check if it matches our character class
. If it does we print it else we ignore it.
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i}'
Test:
[jaypal:~/Temp] echo "Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa" |
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i}'
Chip,Dirkland,DrobSphere Inc,cdirkland@hotmail.com,usa
Update:
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i; printf "\n"}' < datafile.csv > asciidata.csv
I have added printf "\n" after the loop to keep the lines separate.
This might work for you (GNU sed):
echo "Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa" |
sed 's/\o346/a+e/g'
Chip,Dirkland,Droba+eSphere Inc,cdirkland@hotmail.com,usa
Then do what you have to do and after to revert do:
echo "Chip,Dirkland,Droba+eSphere Inc,cdirkland@hotmail.com,usa" |
sed 's/a+e/\o346/g'
Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa
If you have tricky characters in strings and want to understand how sed
sees them use the l0
command (see here). Also very useful for debugging difficult regexps.
echo "Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa" |
sed -n 'l0'
Chip,Dirkland,Drob\346Sphere Inc,cdirkland@hotmail.com,usa$
The issue you are having is the local.
if you want to use a collation range like that you need to change the character type and the collation type.
This fails as \x80 -> \xff are invalid in a utf-8 string. note \u0080 != \x80 for utf8.
anyway to get this to work just do
LC_ALL=C sed -i 's/[\d128-\d255]//' FILENAME
this will override LC_CTYPE and LC_COLLATE for the one command and do what you want.
In this case there is a way to just skip non-ASCII chars, not bothering with removing.
LANG=C sed /someemailpattern/
See https://bugzilla.redhat.com/show_bug.cgi?id=440419 and Will sed (and others) corrupt non-ASCII files?.