快速获取URL中的host-兼容ipv6

0x01背景

ipv4即将耗尽， ipv6已然到来，很多公司应该都在做ipv6的适配工作或者已经做了。最近在开发的时候遇到了一个url解析的问题，需要考虑到ipv6地址的url，于是简单梳理如下。所有场景主要有以下几类：

http://host/
http://host:port/
http://ipv4
http://ipv4:port/
http://ipv6
http://ipv6:port

0x02一句话解决方案

1.如果python的版本>2.7，且ipv6的url符合RFC3986规范，则直接使用urlparse解析即可

2.如果python版本低于2.7或者包含ipv6的url不符合规范，则不能使用urlparse进行解析hostname，需要自定义一个方法实现，参考如下

import socket
from urlparse import urlparse

def is_ipv6(ip):
    try:
        socket.inet_pton(socket.AF_INET6, ip)
    except socket.error:  
        return False
    return True


def extract_host_from_url(url):
    host = urlparse(url).netloc
    print 'netloc = ', host
    if not is_ipv6(host):
        last_colon_index = host.rfind(':')
        print 'last_colon_index is ', last_colon_index
        if last_colon_index == -1:
            return host
        host = host[:last_colon_index]
    print 'extract host from url is : ', host
    return host

0x03符合RFC3986的场景

什么是RFC3986

The notation in that case is to encode the IPv6 IP number in square brackets:
http://[2001:db8:1f70::999:de8:7648:6e8]:100/
That's RFC 3986, section 3.2.2: Host

A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax. In anticipation of future, as-yet-undefined IP literal address formats, an implementation may use an optional version flag to indicate such a format explicitly rather than rely on heuristic determination.

根据RFC文档，可以得知，为了规范化，ipv6地址的url必须将ipv6用中括号括起来。所以解析的时候，需要以此为特征，如果不符合要求，则不予解析

实现

urlparse version2.7 增加了对ipv6的解析支持，实现如下：

https://github.com/enthought/Python-2.7.3/blob/master/Lib/urlparse.py

验证代码如下

# coding: utf-8

from urlparse import urlparse


def test():
    url1 = 'http://www.Python.org/doc/#'
    url2 = 'http://[fe80::240:63ff:fede:3c19]:8080'
    url3 = 'http://[2001:db8:1f70::999:de8:7648:6e8]:100/'
    url4 = 'http://[2001:db8:1f70::999:de8:7648:6e8]'
    urls = [url1, url2, url3, url4]
    for url in urls:
        up = urlparse(url)
        print up.hostname, up.port


if __name__ == '__main__':
    test()

运行结果

www.python.org None
fe80::240:63ff:fede:3c19 8080
2001:db8:1f70::999:de8:7648:6e8 100
2001:db8:1f70::999:de8:7648:6e8 None

0x04 不符合RFC3986的场景

因为ipv6表达式的特殊性：

每项数字前导的0可以省略，省略后前导数字仍是0则继续，例如下组IPv6是等价的：

2001:0DB8:02de:0000:0000:0000:0000:0e13
2001:DB8:2de:0000:0000:0000:0000:e13
2001:DB8:2de:000:000:000:000:e13
2001:DB8:2de:00:00:00:00:e13
2001:DB8:2de:0:0:0:0:e13

可以用双冒号“::”表示一组0或多组连续的0，但只能出现一次：

2001:DB8:2de:0:0:0:0:e13
2001:DB8:2de::e13
2001:0DB8:0000:0000:0000:0000:1428:57ab
2001:0DB8:0000:0000:0000::1428:57ab
2001:0DB8:0:0:0:0:1428:57ab
2001:0DB8:0::0:1428:57ab
2001:0DB8::1428:57ab

那么问题来了，如果ipv6里缩写的形式，例如2001:0DB8::1428:57ab，那么加上端口2001:0DB8::1428:57ab:443,仍然是ipv6的合法表达式，因为两个冒号既可以表示原来的四组0，也可以表示为三组0，把最后的443端口当做是ipv6的一部分。