What's the best “file format” for saving complete web pages (images, etc.) in a single archive? [closed]

被刻印的时光 ゝ 提交于 2019-11-28 03:49:30

My favourite is the ZIP format. Because:

  • It is very well sutied for the purpose
  • It is well documented
  • There a a lot of implementations available for creating or reading them
  • A user can easily extract single files, change them and put them back in the archive
  • Almost every major Operating System (Windows, Mac and most linux) have a ZIP program built in

The alternatives all have some flaw:

  • With MHTMl, you can not easily edit.
  • With data URI's, I don't know how difficult the implementation would be. (With ZIP, even I could do it in PHP, 3 years ago...)
  • The option to store things as seperate files just has far too many things that could go wrong and mess up your archive.

PDFs are supported on nearly all browsers on nearly all platforms and store content and images in a single file. They can be edited with the right tools. This is almost definitely not ideal, but it's an option to consider.

It is not only question of file format. Another crucial question is what exactly you want to store? Is it:

  1. store whole page as it is with all referenced resources - images, CSS and javascript?

  2. to capture page as it was rendered at some point in time; a static image of some rendered state of web page DOM?

Most current "save page as" functionality in browser, be it to MAF or MHTML or file+dir, attempts the first way. This is ultimately flawed approach.

Don't forget web pages there days are rather local applications then a static document you can easily store. Potential issues:

  1. one page is in fact several pages build dynamically by JS, user interaction is needed to get it to desired state

  2. AJAX applications can do remote communication with remote service rendering it unusable for offline view.

  3. Hidden links in javascript code. Such resource is then not part of stored page. Even parsing JS code may not discover them. You need to run the code.

  4. Even position of basic html elements may be recomputed may be computed dynamically by JS and it is not always possible/easy to recreate it locally.

  5. You would need some sort of JS memory dump and load this to get page to desired state you hoped to store

And many many more issues...

Check Chrome SingleFile extension. It stores a web page to one html file with images inlined using already mentioned data URIs. I haven't tested it much so I cannot say how well it handles "volatile" ajax pages.

Use a zip file.

You could always make a program/script that extracts the zip file to a temp directory and loads the index.html file in your browser. You could even use an index.ini/txt file to specify the file that should be loaded when extracting.

Basically, you want something like the Mozilla Archive format, but without the unnecessary rdf crap just to specify what file to load.

MHT files are good, but they usually use base64 to embed files, which will make the file size bigger than it should be (data URIs are the same way). You can add attachments as binary, but you'll have to manually do that with a hex editor or create a tool and support for it by clients might not be as good.

Of course, if you want to use what browsers generate, MHT (Opera and IE at least) might be better.

i see no excuse to use anything other than a zipfile

Well, if browser support and ease of editing are the biggest concerns I think you are stuck with the file+directory approach unless you are willing to provide an editor for the single file format and live with not very good support in browsers.

You can create a single file by compressing the contents. You can also create a parent directory to ease handling.

Devon Carter

The problem is that html is bottoms up not top down. Look at your file name which saved on my box as "What's the best "file format" for saving complete web pages (images, etc.) in a single archive? - Stack Overflow.html"

Just add a '|' and one has trouble doing copy and paste backups to a spare drive. In the end you end up. chopping the file name in order to save it. Dozens/ perhaps hundreds of identical index.html or index.php are cluttering my drives.

The partial solution is to write you own CMS and use scripts to map all relevant files to a flat file database - then use fileName, size, mtime and md5 to get a unique Id for each file. Create a flat file index permitting 100k or 1000k records. The goal is to write once and use many times. So you need a real CMS you need a unique id based on content (eg index8765432.html) that goes in your files_archive. Ditto for the others. Then you can non-destructively symlink from the saved original html to the files_archive and just recreate the file using a php or alternative script if need be. Don't know if it will work as I'm at the same point you're at - maybe in a week will know for sure. The more useful approach is to have a top down structure based on your business or personal wants and related tasks. So your files might be organized top down but external ones bottom up to preserve the original content. My interest is in Web 3.0 services and the closer you get to machine to machine interaction the greater the need to structure the information. Maybe time to rethink the idea of bundling everything into a single file. So you have hundreds of main.css why bundle when a top down solution might let you modify one file instead of hundreds.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!