HTML encoding and lxml parsing
I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered: 1. <!DOCTYPE html> <html lang='en'> <head> <title>Unicode Chars: 은 —’</title> <meta charset='utf-8'> </head> <body></body> </html> 2. <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR"> <head> <title>Unicode Chars: 은 —’</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> </head> <body></body> </html> 3. <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C/