A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?)
It does so many "clean ups"; Beautiful Soup down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet.
This is a known standard for Journalist companies.