Long ago, I wrote a style 'normalizer' program to scan the ASP/HTML code of a big pile of classic ASP pages (most of which were originally generated from MS-Word documents, so naturally they were littered with superflous stylesheets and massive one-off styles). The style normalizer generated a minimal set of stylesheets and styles and a new 'sanitized' asp/html document, so that the sanitized document produced exactly the same rendered output as the original (verified with screenshot image comparisons).
Every now and then, I run across a need for a program like this, and am toying with the idea of writing one for commercial release.
My googling skills have not turned up anything exactly like this (HTML:Normalize Perl module and HTML Tidy project just seem to clean up tags).
So, my questions are:
- is there such a tool already, commercial or otherwise?
- if not, does anybody really need it?
- if so, what features would make it truly worthwhile?
re #3 for example, collecting a base stylesheet for a set of pages, or adjusting all pages to use a given base stylesheet; preserving classic asp commands, following #includes, preserving asp.net embedded scripts, et al. The more specific and numerous, the better.
Example:
Old html w/embedded tags
<html><head>
<title>title</title>
<style type='css/text'>
.cls1 { font-family: arial; font-size: 10px; font-weight: bold; }
</style>
</head>
<body>
<% somefunction() %>
<div class='cls1' style='font-size:10px;'>test div</div>
</body>
</html>
New html
<html><head>
<title>title</title>
<style type='css/text'>
.cls1 { font-family: arial; font-size: 10px; font-weight: bold; }
</style>
</head>
<body>
<% somefunction() %>
<div class='cls1'>test div</div>
</body>
</html>
Note that the style on the div is gone, since it was redundant with the class cls1
EDIT: removed the term 'sanitizer' since i'm not focused on XSS attacks or filtering input in comments, merely on consolidating a lot of ad-hoc styles and random CSS classes into a minimal coherent set of stylesheets.
Well, I can't say definitively that this "works" for everything described, but Tidy does a bit more than clean up tags.
See the HTML Tidy Configuration Options, especially those relating to Microsoft Word (like word-2000)
If you want to know if you've done a reasonable job, you should try these tests (using something like Tidy you'll probably find you haven't done a reasonable job).
Some options:
- HTML Purifier in PHP
- lxml.html.clean in Python
- feedparser has an aggressive cleaner in Python
- LiveJournal code in Perl
Anything that uses regular expressions and doesn't parse the markup would be suspect in my mind (and just too complicated to implement).
Old question but some people might still find this useful. Check out http://necolas.github.com/normalize.css/. It works well!
Don't forget beautiful soup
来源:https://stackoverflow.com/questions/299620/is-there-an-html-css-normalizer-that-works