Is there an html css normalizer that works? [closed]

喜夏-厌秋 提交于 2019-12-06 14:45:34

问题


Long ago, I wrote a style 'normalizer' program to scan the ASP/HTML code of a big pile of classic ASP pages (most of which were originally generated from MS-Word documents, so naturally they were littered with superflous stylesheets and massive one-off styles). The style normalizer generated a minimal set of stylesheets and styles and a new 'sanitized' asp/html document, so that the sanitized document produced exactly the same rendered output as the original (verified with screenshot image comparisons).

Every now and then, I run across a need for a program like this, and am toying with the idea of writing one for commercial release.

My googling skills have not turned up anything exactly like this (HTML:Normalize Perl module and HTML Tidy project just seem to clean up tags).

So, my questions are:

  1. is there such a tool already, commercial or otherwise?
  2. if not, does anybody really need it?
  3. if so, what features would make it truly worthwhile?

re #3 for example, collecting a base stylesheet for a set of pages, or adjusting all pages to use a given base stylesheet; preserving classic asp commands, following #includes, preserving asp.net embedded scripts, et al. The more specific and numerous, the better.

Example:
Old html w/embedded tags

<html><head>
<title>title</title>
<style type='css/text'>
.cls1 { font-family: arial; font-size: 10px; font-weight: bold; }
</style>
</head>
<body>
<% somefunction() %>
<div class='cls1' style='font-size:10px;'>test div</div>
</body>
</html>

New html

<html><head>
<title>title</title>
<style type='css/text'>
.cls1 { font-family: arial; font-size: 10px; font-weight: bold; }
</style>
</head>
<body>
<% somefunction() %>
<div class='cls1'>test div</div>
</body>
</html>

Note that the style on the div is gone, since it was redundant with the class cls1

EDIT: removed the term 'sanitizer' since i'm not focused on XSS attacks or filtering input in comments, merely on consolidating a lot of ad-hoc styles and random CSS classes into a minimal coherent set of stylesheets.


回答1:


Well, I can't say definitively that this "works" for everything described, but Tidy does a bit more than clean up tags.

See the HTML Tidy Configuration Options, especially those relating to Microsoft Word (like word-2000)




回答2:


If you want to know if you've done a reasonable job, you should try these tests (using something like Tidy you'll probably find you haven't done a reasonable job).

Some options:

  • HTML Purifier in PHP
  • lxml.html.clean in Python
  • feedparser has an aggressive cleaner in Python
  • LiveJournal code in Perl

Anything that uses regular expressions and doesn't parse the markup would be suspect in my mind (and just too complicated to implement).




回答3:


Old question but some people might still find this useful. Check out http://necolas.github.com/normalize.css/. It works well!




回答4:


Don't forget beautiful soup

How do I fix wrongly nested / unclosed HTML tags?



来源:https://stackoverflow.com/questions/299620/is-there-an-html-css-normalizer-that-works

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!