Can Goutte/Guzzle be forced into UTF-8 mode?

前端 未结 3 1818
傲寒
傲寒 2020-12-15 13:26

I\'m scraping from a UTF-8 site, using Goutte, which internally uses Guzzle. The site declares a meta tag of UTF-8, thus:



        
3条回答
  •  谎友^
    谎友^ (楼主)
    2020-12-15 13:51

    I seem to have been hitting two bugs here, one of which was identified by Peter's answer. The other was the way in which I am separately using the Symfony Crawler class to explore HTML snippets.

    I was doing this (to parse the HTML for a table row):

    $subCrawler = new Crawler($rowHtml);
    

    Adding HTML via the constructor, however, does not appear to give a way in which the character set can be specified, and I assume ISO-8859-1 is again the default.

    Simply using addHtmlContent gets it right; the second parameter specifies the character set, and it defaults to UTF-8 if it is not specified.

    $subCrawler = new Crawler();
    $subCrawler->addHtmlContent($rowHtml);
    

提交回复
热议问题