Can Goutte/Guzzle be forced into UTF-8 mode?

前端未结

关注

 3  1812

I\'m scraping from a UTF-8 site, using Goutte, which internally uses Guzzle. The site declares a meta tag of UTF-8, thus:


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情深已故        
                
              
                            
                2020-12-15 13:29
              
            
            
                                                                       
Crawler tries detect charset from <meta charset tag but frequently it's missing and then Crawler uses charset by default (ISO-8859-1) - it is source of problem described in this thread. 

When we are passing content to Crawler through constructor we miss Content-Type header that usually contains charset. 

Here's how we can handle it: 

$crawler = new Crawler();
$crawler->addContent(
    $response->getBody()->getContents(), 
    $response->getHeaderLine('Content-Type')
);


With this solution we are using correct charset from server response and don't bind our solution to any single charset and of course after that we don't need decode every single received line from Crawler (using utf8_decode() or somehow else).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2020-12-15 13:34
              
            
            
                                                                       
The issue is actually with symfony/browser-kit and symfony/domcrawler. The browserkit's Client does not examine the HTML meta tags to determine the charset, content-type header only. When the response body is handed over to the domcrawler, it is treated as the default charset ISO-8859-1. After examining the meta tags that decision should be reverted and the DomDocument rebuilt, but that never happens.

The easy workaround is to wrap $crawler->text() with utf8_decode():

$text = utf8_decode($crawler->text());


This works if the input is UTF-8. I suppose for other encodings something similar can be achieved with iconv() or so. However, you have to remember to do that every time you call text().

A more generic approach is to make the Domcrawler believe that it deals with UTF-8. To that end I've come up with a Guzzle plugin that overwrites (or adds) the charset in the content-type response header. You can find it at https://gist.github.com/pschultz/6554265. Usage is like this:

<?php

use Goutte\Client;


$plugin = new ForceCharsetPlugin();
$plugin->setForcedCharset('utf-8');

$client = new Client();
$client->getClient()->addSubscriber($plugin);
$crawler = $client->request('get', $url);

echo $crawler->text();

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  谎友^        
                
              
                            
                2020-12-15 13:51
              
            
            
                                                                       
I seem to have been hitting two bugs here, one of which was identified by Peter's answer. The other was the way in which I am separately using the Symfony Crawler class to explore HTML snippets.

I was doing this (to parse the HTML for a table row):

$subCrawler = new Crawler($rowHtml);


Adding HTML via the constructor, however, does not appear to give a way in which the character set can be specified, and I assume ISO-8859-1 is again the default.

Simply using addHtmlContent gets it right; the second parameter specifies the character set, and it defaults to UTF-8 if it is not specified.

$subCrawler = new Crawler();
$subCrawler->addHtmlContent($rowHtml);

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复