I have a file with a list of user-agents which are encoded. E.g.:
Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en
I want a shell script which can read this file and write to a new file with decoded strings.
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en
I have been trying to use this example to get it going but it is not working so far.
$ echo -e "$(echo "%31+%32%0A%33+%34" | sed 'y/+/ /; s/%/\\x/g')"
My script looks like:
#!/bin/bash
for f in *.log; do
echo -e "$(cat $f | sed 'y/+/ /; s/%/\x/g')" > y.log
done
Here is a simple one-line solution.
$ urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }
It may look like perl :) but it is just pure bash. No awks, no seds ... no overheads. Using the : builtin, special parameters, pattern substitution and the echo builtin's -e option to translate hex codes into characters. See bash's manpage for further details. You can use this function as separate command
$ urldecode https%3A%2F%2Fgoogle.com%2Fsearch%3Fq%3Durldecode%2Bbash
https://google.com/search?q=urldecode+bash
or in variable assignments, like so:
$ x="http%3A%2F%2Fstackoverflow.com%2Fsearch%3Fq%3Durldecode%2Bbash"
$ y=$(urldecode "$x")
$ echo "$y"
http://stackoverflow.com/search?q=urldecode+bash
GNU awk
#!/usr/bin/awk -fn
@include "ord"
BEGIN {
RS = "%.."
}
{
printf RT ? $0 chr("0x" substr(RT, 2)) : $0
}
Or
#!/bin/sh
awk -niord '{printf RT?$0chr("0x"substr(RT,2)):$0}' RS=%..
This is what seems to be working for me.
#!/bin/bash
urldecode(){
echo -e "$(sed 's/+/ /g;s/%\(..\)/\\x\1/g;')"
}
for f in /opt/logs/*.log; do
name=${f##/*/}
cat $f | urldecode > /opt/logs/processed/$HOSTNAME.$name
done
Replacing '+'s with spaces, and % signs with '\x' escapes, and letting echo interpret the \x escapes using the '-e' option was not working. For some reason, the cat command was printing the % sign as its own encoded form %25. So sed was simply replacing %25 with \x25. When the -e option was used, it was simply evaluating \x25 as % and the output was same as the original.
Trace:
Original: Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en
sed: Mozilla\x252F5.0\x2520\x2528Macintosh\x253B\x2520U\x253B\x2520Intel\x2520Mac\x2520OS\x2520X\x252010.6\x253B\x2520en
echo -e: Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en
Fix: Basically ignore the 2 characters after the % in sed.
sed: Mozilla\x2F5.0\x20\x28Macintosh\x3B\x20U\x3B\x20Intel\x20Mac\x20OS\x20X\x2010.6\x3B\x20en
echo -e: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en
Not sure what complications this would result in, after extensive testing, but works for now.
With BASH, to read the per cent encoded URL from standard in and decode:
while read; do echo -e ${REPLY//%/\\x}; done
Press CTRL-D to signal the end of file(EOF) and quit gracefully.
You can decode the contents of a file by setting the file to be standard in:
while read; do echo -e ${REPLY//%/\\x}; done < file
You can decode input from a pipe either, for example:
echo 'a%21b' | while read; do echo -e ${REPLY//%/\\x}; done
- The read built in command reads standard in until it sees a Line Feed character. It sets a variable called
REPLY
equal to the line of text it just read. ${REPLY//%/\\x}
replaces all instances of '%' with '\x'.echo -e
interprets\xNN
as the ASCII character with hexadecimal value ofNN
.- while repeats this loop until the read command fails, eg. EOF has been reached.
The above does not change '+' to ' '. To change '+' to ' ' also, like guest's answer:
while read; do : "${REPLY//%/\\x}"; echo -e ${_//+/ }; done
:
is a BASH builtin command. Here it just takes in a single argument and does nothing with it.- The double quotes make everything inside one single parameter.
_
is a special parameter that is equal to the last argument of the previous command, after argument expansion. This is the value ofREPLY
with all instances of '%' replaced with '\x'.${_//+/ }
replaces all instances of '+' with ' '.
This uses only BASH and doesn't start any other process, similar to guest's answer.
if you are a python developer, this maybe preferer
echo "%21%20" | python -c "import sys, urllib as ul; print ul.unquote(sys.stdin.read());"
urllib is professional at handling it
perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/pack H2,$1/gie' ./*.log
With -i
updates the files in-place (some sed
implementations have borrowed that from perl
) with .back
as the backup extension.
s/x/y/e
substitutes x
with the evaluation of the y
perl code.
The perl code in this case uses pack
to pack the hex number captured in $1
(first parentheses pair in the regexp) as the corresponding character.
An alternative to pack
is to use chr(hex($1))
:
perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/chr hex $1/gie' ./*.log
If available, you could also use uri_unescape()
from URI::Escape
:
perl -pi.back -MURI::Escape -e 'y/+/ /;$_=uri_unescape$_' ./*.log
Bash script for doing it in native Bash (original source):
LANG=C
urlencode() {
local l=${#1}
for (( i = 0 ; i < l ; i++ )); do
local c=${1:i:1}
case "$c" in
[a-zA-Z0-9.~_-]) printf "$c" ;;
' ') printf + ;;
*) printf '%%%.2X' "'$c"
esac
done
}
urldecode() {
local data=${1//+/ }
printf '%b' "${data//%/\x}"
}
If you want to urldecode file content, just put the file content as an argument.
Here's a test that will run halt if the decoded encoded file content differs (if it runs for a few seconds, the script probably works correctly):
while true
do cat /dev/urandom | tr -d '\0' | head -c1000 > /tmp/tmp;
A="$(cat /tmp/tmp; printf x)"
A=${A%x}
A=$(urlencode "$A")
urldecode "$A" > /tmp/tmp2
cmp /tmp/tmp /tmp/tmp2
if [ $? != 0 ]
then break
fi
done
If you have php installed on your server, you can "cat" or even "tail" any file, with url encoded strings very easily.
tail -f nginx.access.log | php -R 'echo urldecode($argn)."\n";'
As @barti_ddu said in the comments, \x
"should be [double-]escaped".
% echo -e "$(echo "Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en" | sed 'y/+/ /; s/%/\\x/g')"
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en
Rather than mixing up Bash and sed, I would do this all in Python. Here's a rough cut of how:
#!/usr/bin/env python
import glob
import os
import urllib
for logfile in glob.glob(os.path.join('.', '*.log')):
with open(logfile) as current:
new_log_filename = logfile + '.new'
with open(new_log_filename, 'w') as new_log_file:
for url in current:
unquoted = urllib.unquote(url.strip())
new_log_file.write(unquoted + '\n')
With GNU awk
:
gawk -vRS='%[0-9a-fA-F]{2}' 'RT{sub("%","0x",RT);RT=sprintf("%c",strtonum(RT))}
{gsub(/\+/," ");printf "%s", $0 RT}'
Here is a solution that is done in pure bash where input and output are bash variables. It will decode '+' as a space and handle the '%20' space, as well as other %-encoded characters.
#!/bin/bash
#here is text that contains both '+' for spaces and a %20
text="hello+space+1%202"
decoded=$(echo -e `echo $text | sed 's/+/ /g;s/%/\\\\x/g;'`)
echo decoded=$decoded
$ uenc='H%C3%B6he %C3%BCber%20dem%20Meeresspiegel'
$ utf8=$(echo -e "${uenc//%/\\x}")
$ echo $utf8
Höhe über dem Meeresspiegel
$
Expanding to
https://stackoverflow.com/a/37840948/8142470
to work with HTML entities
$ htmldecode() { : "${*//+/ }"; echo -e "${_//&#x/\x}" | tr -d ';'; }
$ htmldecode "http://google.com/search&?q=urldecode+bash" http://google.com/search&?q=urldecode+bash
(argument must be quoted)
Facing a similar problem, my initial idea was to use urldecode from PHP in a script that read stdin or some-such, but then I came across this idea. All the answers seem to have a lot of text, but present no real solution. The idea is sound though, and incredibly easy to get working:
$ mpc | sed -e '1! d'
http://e.org/play.php?name=/Black%20Sun%20Empire%20-%20Sideways%20%28Feat.%20Illy%20Emcee%29
$ basename "$(echo -e `mpc | sed -e '1! d' -e 's/%/\\\\x/g'`)"
Black Sun Empire - Sideways (Feat. Illy Emcee)
The key to making it work is double-escaping \x (this has been mentioned already).
Just wanted to share this other solution, pure bash:
encoded_string="Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en"
printf -v encoded_string "%b" "${encoded_string//\%/\x}"
echo $encoded_string
A slightly modified version of the Python answer that accepts an input and output file in a one liner.
cat inputfile.txt | python -c "import sys, urllib as ul; print ul.unquote(sys.stdin.read());" > ouputfile.txt
$ uenc='H%C3%B6he %C3%BCber%20dem%20Meeresspiegel'
$ utf8=$(printf "${uenc//%/\\x}")
$ echo $utf8
Höhe über dem Meeresspiegel
$
来源:https://stackoverflow.com/questions/6250698/how-to-decode-url-encoded-string-in-shell