Can't open Unicode URL with Python

前端未结

关注

 5  1437

Using Python 2.5.2 and Linux Debian, I\'m trying to get the content from a Spanish URL that contains a Spanish char \'í\':

import urllib
url = u


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2020-12-09 20:28
              
            
            
                                                                       
Per the applicable standard, RFC 1378, URLs can only contain ASCII characters.  Good explanation here, and I quote:


  "...Only alphanumerics [0-9a-zA-Z],
  the special characters "$-_.+!*'(),"
  [not including the quotes - ed], and
  reserved characters used for their
  reserved purposes may be used
  unencoded within a URL."


As the URLs I've given explain, this probably means you'll have to replace that "lowercase i with acute accent" with `%ED'.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-12-09 20:34
              
            
            
                                                                       
This works for me:    

#!/usr/bin/env python
# define source file encoding, see: http://www.python.org/dev/peps/pep-0263/
# -*- coding: utf-8 -*-

import urllib
url = u'http://example.com/índice.html'
content = urllib.urlopen(url.encode("UTF-8")).read()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野的像风        
                
              
                            
                2020-12-09 20:35
              
            
            
                                                                       
It works for me. Make sure you're using a fairly recent version of Python, and your file encoding is correct.
Here's my code:

# -*- coding: utf-8 -*-
import urllib
url = u'http://mydomain.es/índice.html'
url = url.encode('utf-8')
content = urllib.urlopen(url).read()


(mydomain.es does not exist, so the DNS lookup fails, but there are no unicode issues to that point.)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2020-12-09 20:51
              
            
            
                                                                       
Encoding the URL as utf-8, should have worked. I wonder if your source file is properly encoded, and whether the interpreter knows it. If your python source file is saved as UTF-8, for example, then you should have

# coding=UTF-8


as the first or second line.

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url.encode('utf-8')).read()


works for me.

Edit: also, be aware that Unicode text in an interactive Python session (whether through IDLE, or a console) is fraught with encoding-related difficulty. In those cases, you should use Unicode literals (like \u00ED in your case).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  旧时难觅i        
                
              
                            
                2020-12-09 20:55
              
            
            
                                                                       
I'm having a similar case, right now. I'm trying to download images. I retrieve the URLs from the server in a JSON file. Some of the images contain non-ASCII characters. This throws an error:

for image in product["images"]: 
    filename = os.path.basename(image) 
    filepath = product_path + "/" + filename 
    urllib.request.urlretrieve(image, filepath) # error!



  UnicodeEncodeError: 'ascii' codec can't encode character '\xc7' in position ...




I've tried using .encode("UTF-8"), but can't say it helped:

# coding=UTF-8
import urllib
url = u"http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = url.encode("UTF-8")
urllib.request.urlretrieve(url, "D:\image-1.jpg")


This just throws another error:


  TypeError: cannot use a string pattern on a bytes-like object




Then I gave urllib.parse.quote(url) a go:

import urllib
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.quote(url)
urllib.request.urlretrieve(url, "D:\image-1.jpg")


and again, this throws another error:


  ValueError: unknown url type: 'http%3A//example.com/wp-content/uploads/2018/09/%C4%B0MAGE-1.png'


The : in "http://..." also got escaped, and I think this is the cause of the problem.

So, I've figured out a workaround. I just quote/escape the path, not the whole URL.

import urllib.request
import urllib.parse
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.urlparse(url)
url = url.scheme + "://" + url.netloc + urllib.parse.quote(url.path)
urllib.request.urlretrieve(url, "D:\image-1.jpg")


This is what the URL looks like: "http://example.com/wp-content/uploads/2018/09/%C4%B0MAGE-1.png", and now I can download the image.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复