Generating ZIP files with PHP + Apache on-the-fly in high speed?

后端未结

关注

 5  1586

To quote some famous words:

“Programmers… often take refuge in an understandable, but disastrous, inclination towards complexity and ingenuity in thei

相关标签:

5条回答

一向

2020-12-15 11:18

i have a download page, and made a zip class that is very similar to your ideas. my downloads are very big files, that can't be zipped properly with the zip classes out there.

and i had similar ideas as you. the approach to give up the compression is very good, with that you not even need fewer cpu resources, you save memory because you don't have to touch the input files and can pass it throught, you can also calculate everything like the zip headers and the end filesize very easy, and you can jump to every position and generate from this point to realize resume.

I go even further, i generate one checksum from all the input file crc's, and use it as an e-tag for the generated file to support caching, and as part of the filename. If you have already download the generated zip file the browser gets it from the local cache instead of the server. You can also adjust the download rate (for example 300KB/s). One can make zip comments. You can choose which files can be added and what not (for example thumbs.db).

But theres one problem that you can't overcome with the zip format completely. Thats the generation of the crc values. Even if you use hash-file to overcome the memory problem, or use hash-update to incrementally generate the crc, it will use to much cpu resources. Not much for one person, but not recommend for professional use. I solved this with an extra crc value table that i generate with an extra script. I add this crc values per parameter to the zip class. With this, the class is ultra fast. Like a regular download script, as you mentioned.

My zip class is work in progress, you can have a look at it here: http://www.ranma.tv/zip-class.txt

I hope i can help someone with that :)

But i will discontinue this approach, i will reprogram my class to a tar class. With tar i don't need to generate crc values from the files, tar only need some checksums for the headers, thats all. And i don't need an extra mysql table any more. I think it makes the class easier to use, if you don't have to create an extra crc table for it. It's not so hard, because tars file structure is easier as the zip structure.

PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?

If your script is safe and it closes on user abort, then you can remove it completely. But it would be safer, if you just renew the timeout on every file that you pass throught :)

With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?

Yes that would work. I had generated a checksum from the input file crc's. I used this as an e-tag and as part of the zip filename. If something changed, the user can't resume the generated zip, because the e-tag and filename changed together with the content.

Will passing large amounts of file data through PHP not be a performance hit in itself?

No, if you only pass throught it will not use much more then a regular download. Maybe 0.01% i don't know, its not much :) I assume because php don't do much with the data :)

0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-12-15 11:24
You can use ZipStream or PHPZip, which will send zipped files on the fly to the browser, divided in chunks, instead of loading the entire content in PHP and then sending the zip file.

Both libraries are nice and useful pieces of code. A few details:
- ZipStream "works" only with memory, but cannot be easily ported to PHP 4 if necessary (uses hash_file())
- PHPZip writes temporary files on disk (consumes as much disk space as the biggest file to add in the zip), but can be easily adapted for PHP 4 if necessary.
0 讨论(0)
发布评论:

提交评论
- 加载中...
长发绾君心

2020-12-15 11:26

This may be what you need: http://pablotron.org/software/zipstream-php/

This lib allows you to build a dynamic streaming zip file without swapping to disk.

0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2020-12-15 11:27

Use e.g. the PhpConcept Library Zip library.

Resuming must be supported by your webserver except the case where you don't make the zipfiles accessible directly. If you have a php script as mediator then pay attention to sending the right headers to support resuming.

The script creating the files shouldn't timeout ever just make sure the users can't select thousands of files at once. And keep something in place to remove "old zipfiles" and watch out that some malicious user doesn't use up your diskspace by requesting many different filecollections.

0 讨论(0)
发布评论:

提交评论
- 加载中...
广开言路

2020-12-15 11:28
You're going to have to store the generated zip file, if you want them to be able to resume downloads.

Basically you generate the zip file and chuck it in a /tmp directory with a repeatable filename (hash of the search filters maybe). Then you send the correct headers to the user and echo file_get_contents to the user.

To support resuming you need to check out the $_SERVER['HTTP_RANGE'] value, it's format is detailed here and once your parsed that you'll need to run something like this.
```
$size = filesize($zip_file);

if(isset($_SERVER['HTTP_RANGE'])) {
    //parse http_range
    $range = explode( '-', $seek_range);
    $new_length = $range[1] - $range[0]
    header("HTTP/1.1 206 Partial Content");
    header("Content-Length: $new_length");
    header("Content-Range: bytes {$range[0]}-$range[1]");
    echo file_get_contents($zip_file, FILE_BINARY, null, $range[0], $new_length);
} else {
    header("Content-Range: bytes 0-$size");
    header("Content-Length: ".$size);
    echo file_get_contents($zip_file);
} 
```
This is very sketchy code, you'll probably need to play around with the headers and the contents to the HTTP_RANGE variable a bit. You can use fopen and fwrite rather than file_get contents if you wish and just fseek to the right place.

Now to your questions
- PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
You can remove it if you want to, however if something goes pear shaped and your code get stuck in an infinite loop at can lead to interesting problems should that infinite loop be logging and error somewhere and you don't notice, until a rather grumpy sys-admin wonders why their server ran out of hard disk space ;)
- With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Cache the file to the hard disk, means you wont have this problem.
- Will passing large amounts of file data through PHP not be a performance hit in itself?
Yes it wont be as fast as a regular download from the webserver. But it shouldn't be too slow.
0 讨论(0)
发布评论:

提交评论
- 加载中...