I\'m scraping from a UTF-8 site, using Goutte, which internally uses Guzzle. The site declares a meta tag of UTF-8, thus:
I seem to have been hitting two bugs here, one of which was identified by Peter's answer. The other was the way in which I am separately using the Symfony Crawler class to explore HTML snippets.
I was doing this (to parse the HTML for a table row):
$subCrawler = new Crawler($rowHtml);
Adding HTML via the constructor, however, does not appear to give a way in which the character set can be specified, and I assume ISO-8859-1 is again the default.
Simply using addHtmlContent gets it right; the second parameter specifies the character set, and it defaults to UTF-8 if it is not specified.
$subCrawler = new Crawler();
$subCrawler->addHtmlContent($rowHtml);