Downloading files in chunks in python?

丶灬走出姿态 提交于 2021-02-10 21:42:27

问题


I am writing a simple synchronous download manager which downloads a video file in 10 sections. I am using requests to get content-length from headers. Using this I am breaking and downloading files in 10; byte chunks and then merging them to form a complete video. The code below suppose to work this way but the end merged file only works for seconds and after that it gets corrupted. What is wrong in my code?

import requests
import os

def intervals(parts, duration):
    part_duration = duration // parts
    return [(i * part_duration, (i + 1) * part_duration) for i in range(parts)]

home = os.path.expanduser("~")
if not os.path.exists(home+'/Desktop/temp'):
    os.makedirs(home+'/Desktop/temp')

PATH = home+"/Desktop/temp/tmp.mp4"

example_file_url = "https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4"


req = requests.head(example_file_url)

size = int(req.headers['Content-Length'])

content_section = 10

section_intervals = intervals(content_section,size)


with  open(PATH, "wb") as file:
    for i,(start,end) in enumerate(section_intervals):
        headers = {"Range": "bytes="+str(start)+"-"+str(end)}
        print(headers)
        r = requests.get(example_file_url, headers=headers)
        file.write(r.content)

回答1:


The problem

Your ranges are wrong because the interval specified by a Range header gives the first and the last offset, e.g. bytes=0-10 means 11 bytes from 0 to 10 (unlike how slices work in python), so bytes=0-10 and bytes=10-20 are overlapping ranges. For example, you would need bytes=0-9 followed by bytes=10-19 instead.

See the example in this documentation:

header requesting the first 1024 bytes ... Range: bytes=0-1023

(whereas [0:1023] in a python slice would be length 1023).

Where you say that it "works for seconds and after that gets corrupted", I assume that you mean that it is valid for the first few seconds of decoded MP4 output. The point where it breaks will be the end of the first downloaded part, where the final byte of the first part is duplicated at the start of the second part.

Another problem is that your total length is wrong because you do integer division by parts and then by the time that you multiply it up again, you have lost the final fractional part.

The fix

Change your intervals function to this, and it works:

import math

def intervals(parts, duration):
    part_duration = math.ceil(duration / parts)
    return [(start, min(start + part_duration - 1, duration - 1)) 
             for start in range(0, duration, part_duration)]

Inspecting the ranges

Inserting print statements:

print("Size = ", size)
print(section_intervals)

now gives:

Size =  9840497
[(0, 984049), (984050, 1968099), (1968100, 2952149), (2952150, 3936199), (3936200, 4920249), (4920250, 5904299), (5904300, 6888349), (6888350, 7872399), (7872400, 8856449), (8856450, 9840496)]

whereas using your original intervals function, it gives:

Size =  9840497
[(0, 984049), (984049, 1968098), (1968098, 2952147), (2952147, 3936196), (3936196, 4920245), (4920245, 5904294), (5904294, 6888343), (6888343, 7872392), (7872392, 8856441), (8856441, 9840490)]

Note the overlapping ranges and the bytes missing from the end.

Verifying output using md5sum

We can verify the download at the end by calculating a checksum. In this example, I use md5sum from the Linux command line (although cksum would work also, as there is no need for cryptographic checksum for this purpose).

I called the output myoutput.

$ md5sum myoutput
10c918b1d01aea85864ee65d9e0c2305  myoutput

Now I also download a copy directly with wget <url> and see that it has the same checksum.

$ wget https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4
--2020-07-21 08:26:52--  https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4

$ md5sum file_example_MP4_1280_10MG.mp4 
10c918b1d01aea85864ee65d9e0c2305  file_example_MP4_1280_10MG.mp4


来源:https://stackoverflow.com/questions/63008887/downloading-files-in-chunks-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!