How to decode URL-encoded string in shell?

问题

I have a file with a list of user-agents which are encoded. E.g.:

Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

I want a shell script which can read this file and write to a new file with decoded strings.

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

I have been trying to use this example to get it going but it is not working so far.

$ echo -e "$(echo "%31+%32%0A%33+%34" | sed 'y/+/ /; s/%/\\x/g')"

My script looks like:

#!/bin/bash
for f in *.log; do
  echo -e "$(cat $f | sed 'y/+/ /; s/%/\x/g')" > y.log
done

回答1:

Here is a simple one-line solution.

$ urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }

It may look like perl :) but it is just pure bash. No awks, no seds ... no overheads. Using the : builtin, special parameters, pattern substitution and the echo builtin's -e option to translate hex codes into characters. See bash's manpage for further details. You can use this function as separate command

$ urldecode https%3A%2F%2Fgoogle.com%2Fsearch%3Fq%3Durldecode%2Bbash
https://google.com/search?q=urldecode+bash

or in variable assignments, like so:

$ x="http%3A%2F%2Fstackoverflow.com%2Fsearch%3Fq%3Durldecode%2Bbash"
$ y=$(urldecode "$x")
$ echo "$y"
http://stackoverflow.com/search?q=urldecode+bash

回答2:

GNU awk

#!/usr/bin/awk -fn
@include "ord"
BEGIN {
  RS = "%.."
}
{
  printf RT ? $0 chr("0x" substr(RT, 2)) : $0
}

#!/bin/sh
awk -niord '{printf RT?$0chr("0x"substr(RT,2)):$0}' RS=%..

Using awk printf to urldecode text

回答3:

if you are a python developer, this maybe preferer

echo "%21%20" | python -c "import sys, urllib as ul; print ul.unquote(sys.stdin.read());"

urllib is professional at handling it

回答4:

This is what seems to be working for me.

#!/bin/bash
urldecode(){
  echo -e "$(sed 's/+/ /g;s/%\(..\)/\\x\1/g;')"
}

for f in /opt/logs/*.log; do
    name=${f##/*/}
    cat $f | urldecode > /opt/logs/processed/$HOSTNAME.$name
done

Replacing '+'s with spaces, and % signs with '\x' escapes, and letting echo interpret the \x escapes using the '-e' option was not working. For some reason, the cat command was printing the % sign as its own encoded form %25. So sed was simply replacing %25 with \x25. When the -e option was used, it was simply evaluating \x25 as % and the output was same as the original.

Trace:

Original: Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

sed: Mozilla\x252F5.0\x2520\x2528Macintosh\x253B\x2520U\x253B\x2520Intel\x2520Mac\x2520OS\x2520X\x252010.6\x253B\x2520en

echo -e: Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

Fix: Basically ignore the 2 characters after the % in sed.

sed: Mozilla\x2F5.0\x20\x28Macintosh\x3B\x20U\x3B\x20Intel\x20Mac\x20OS\x20X\x2010.6\x3B\x20en

echo -e: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

Not sure what complications this would result in, after extensive testing, but works for now.

回答5:

With BASH, to read the per cent encoded URL from standard in and decode:

while read; do echo -e ${REPLY//%/\\x}; done

Press CTRL-D to signal the end of file(EOF) and quit gracefully.

You can decode the contents of a file by setting the file to be standard in:

while read; do echo -e ${REPLY//%/\\x}; done < file

You can decode input from a pipe either, for example:

echo 'a%21b' | while read; do echo -e ${REPLY//%/\\x}; done

The read built in command reads standard in until it sees a Line Feed character. It sets a variable called REPLY equal to the line of text it just read.
${REPLY//%/\\x} replaces all instances of '%' with '\x'.
echo -e interprets \xNN as the ASCII character with hexadecimal value of NN.
while repeats this loop until the read command fails, eg. EOF has been reached.

The above does not change '+' to ' '. To change '+' to ' ' also, like guest's answer:

while read; do : "${REPLY//%/\\x}"; echo -e ${_//+/ }; done

: is a BASH builtin command. Here it just takes in a single argument and does nothing with it.
The double quotes make everything inside one single parameter.
_ is a special parameter that is equal to the last argument of the previous command, after argument expansion. This is the value of REPLY with all instances of '%' replaced with '\x'.
${_//+/ } replaces all instances of '+' with ' '.

This uses only BASH and doesn't start any other process, similar to guest's answer.

回答6:

perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/pack H2,$1/gie' ./*.log

With -i updates the files in-place (some sed implementations have borrowed that from perl) with .back as the backup extension.

s/x/y/e substitutes x with the evaluation of the y perl code.

The perl code in this case uses pack to pack the hex number captured in $1 (first parentheses pair in the regexp) as the corresponding character.

An alternative to pack is to use chr(hex($1)):

perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/chr hex $1/gie' ./*.log

If available, you could also use uri_unescape() from URI::Escape:

perl -pi.back -MURI::Escape -e 'y/+/ /;$_=uri_unescape$_' ./*.log

回答7:

Bash script for doing it in native Bash (original source):

LANG=C

urlencode() {
    local l=${#1}
    for (( i = 0 ; i < l ; i++ )); do
        local c=${1:i:1}
        case "$c" in
            [a-zA-Z0-9.~_-]) printf "$c" ;;
            ' ') printf + ;;
            *) printf '%%%.2X' "'$c"
        esac
    done
}

urldecode() {
    local data=${1//+/ }
    printf '%b' "${data//%/\x}"
}

If you want to urldecode file content, just put the file content as an argument.

Here's a test that will run halt if the decoded encoded file content differs (if it runs for a few seconds, the script probably works correctly):

while true
  do cat /dev/urandom | tr -d '\0' | head -c1000 > /tmp/tmp;
     A="$(cat /tmp/tmp; printf x)"
     A=${A%x}
     A=$(urlencode "$A")
     urldecode "$A" > /tmp/tmp2
     cmp /tmp/tmp /tmp/tmp2
     if [ $? != 0 ]
       then break
     fi
done

回答8:

If you have php installed on your server, you can "cat" or even "tail" any file, with url encoded strings very easily.

tail -f nginx.access.log | php -R 'echo urldecode($argn)."\n";'

回答9:

As @barti_ddu said in the comments, \x "should be [double-]escaped".

% echo -e "$(echo "Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en" | sed 'y/+/ /; s/%/\\x/g')"
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

Rather than mixing up Bash and sed, I would do this all in Python. Here's a rough cut of how:

#!/usr/bin/env python

import glob
import os
import urllib

for logfile in glob.glob(os.path.join('.', '*.log')):
    with open(logfile) as current:
        new_log_filename = logfile + '.new'
        with open(new_log_filename, 'w') as new_log_file:
            for url in current:
                unquoted = urllib.unquote(url.strip())
                new_log_file.write(unquoted + '\n')

回答10:

With GNU awk:

gawk -vRS='%[0-9a-fA-F]{2}' 'RT{sub("%","0x",RT);RT=sprintf("%c",strtonum(RT))}
                             {gsub(/\+/," ");printf "%s", $0 RT}'

回答11:

Here is a solution that is done in pure bash where input and output are bash variables. It will decode '+' as a space and handle the '%20' space, as well as other %-encoded characters.

#!/bin/bash
#here is text that contains both '+' for spaces and a %20
text="hello+space+1%202"
decoded=$(echo -e `echo $text | sed 's/+/ /g;s/%/\\\\x/g;'`)
echo decoded=$decoded

回答12:

$ uenc='H%C3%B6he %C3%BCber%20dem%20Meeresspiegel'
$ utf8=$(echo -e "${uenc//%/\\x}")
$ echo $utf8
Höhe über dem Meeresspiegel
$

回答13:

Updating Jay's answer for Python 3.5+:
echo "%31+%32%0A%33+%34" | python -c "import sys; from urllib.parse import unquote ; print(unquote(sys.stdin.read()))"

Still, brendan's bash solution with explanation seems more direct and elegant.

回答14:

Expanding to https://stackoverflow.com/a/37840948/8142470
to work with HTML entities

$ htmldecode() { : "${*//+/ }"; echo -e "${_//&#x/\x}" | tr -d ';'; }
$ htmldecode "http://google.com/search&?q=urldecode+bash" http://google.com/search&?q=urldecode+bash

(argument must be quoted)

回答15:

Facing a similar problem, my initial idea was to use urldecode from PHP in a script that read stdin or some-such, but then I came across this idea. All the answers seem to have a lot of text, but present no real solution. The idea is sound though, and incredibly easy to get working:

$ mpc | sed -e '1! d'
http://e.org/play.php?name=/Black%20Sun%20Empire%20-%20Sideways%20%28Feat.%20Illy%20Emcee%29

$ basename "$(echo -e `mpc | sed -e '1! d' -e 's/%/\\\\x/g'`)"
Black Sun Empire - Sideways (Feat. Illy Emcee)

The key to making it work is double-escaping \x (this has been mentioned already).

回答16:

Just wanted to share this other solution, pure bash:

encoded_string="Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en"
printf -v encoded_string "%b" "${encoded_string//\%/\x}"
echo $encoded_string

回答17:

A slightly modified version of the Python answer that accepts an input and output file in a one liner.

cat inputfile.txt | python -c "import sys, urllib as ul; print ul.unquote(sys.stdin.read());" > ouputfile.txt

回答18:

$ uenc='H%C3%B6he %C3%BCber%20dem%20Meeresspiegel'
$ utf8=$(printf "${uenc//%/\\x}")
$ echo $utf8
Höhe über dem Meeresspiegel
$

来源：https://stackoverflow.com/questions/6250698/how-to-decode-url-encoded-string-in-shell

标签

bash

shell

awk

sed

urldecode