How to find out if string has already been URL encoded?

左心房为你撑大大i 提交于 2019-11-26 22:20:14

Decode, compare to original. If it does differ, original is encoded. If it doesn't differ, original isn't encoded. But still it says nothing about whether the newly decoded version isn't still encoded. A good task for recursion.

I hope one can't write a quine in urlencode, or this algorithm would get stuck.

Use regexp to check if your string contains illegal characters (i.e. characters which cannot be found in URL-encoded string, like whitespace).

Joel on software had a solution for this sometime back - http://www.joelonsoftware.com/articles/Wrong.html
Or You may add some prefix to the Strings.

Try decoding the url. If the resulting string is shorter than the original then the original URL was already encoded, else you can safely encode it (either it is not encoded, or even post encoding the url stays as is, so encoding again will not result in a wrong url). Below is sample pseudo (inspired by ruby) code:

# Returns encoded URL for any given URL after determining whether it is already encoded or not
    def escape(url)
      unescaped_url = URI.unescape(url)
      if (unescaped_url.length < url.length)
        return url
      else
        return URI.escape(url)
      end
    end

You can't know for sure, unless your strings conform to a certain pattern, or you keep track of your strings. As you noted by yourself, a String that is encoded can also be encoded, so you can't be 100% sure by looking at the string itself.

Check your URL for suspicious characters[1]. List of candidates:

WHITE_SPACE ,", < , > , { , } , | , \ , ^ , ~ , [ , ] , . and `

I use:

private static boolean isAlreadyEncoded(String passedUrl) {
        boolean isEncoded = true;
        if (passedUrl.matches(".*[\\ \"\\<\\>\\{\\}|\\\\^~\\[\\]].*")) {
                isEncoded = false;
        }
        return isEncoded;
}

For the actual encoding I proceed with:

https://stackoverflow.com/a/49796882/1485527

Note: Even if your URL doesn't contain unsafe characters you might want to apply, e.g. Punnycode encoding to the host name. So there is still much space for additional checks.


[1] A list of candidates can be found in the section "unsafe" of the URL spec at Page 2. In my understanding '%' or '#' should be left out in the encoding check, since these characters can occur in encoded URLs as well.

If you want to be sure that string is encoded correctly (if it needs to be encoded) - just decode and encode it once again.

metacode:

100%_correctly_encoded_string = encode(decode(input_string))

already encoded string will remain untouched. Unencoded string will be encoded. String with only url-allowed characters will remain untouched too.

According to the spec (https://tools.ietf.org/html/rfc3986) all URLs MUST start with a scheme followed by a :

Since colons are required as the delimiter between a scheme and the rest of the URI, any string that contains a colon is not encoded.

(This assumes you will not be given an incomplete URI with no scheme.)

So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.

You can make this loop simpler if you know what schemes you can expect.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!