问题
please help!
[+] What I have: A lot of blobs in every bucket. Blobs can vary in size from being less than a Kilo-byte to being lots of Giga-bytes.
[+] What I'm trying to do: I need to be able to either stream the data in those blobs (like a buffer of size 1024 or something like that) or read them by chunks of a certain size in Python. The point is I don't think I can just do a bucket.get_blob() because if the blob was a TeraByte then I wouldn't be able to have it in physical memory.
[+] What I'm really trying to do: parse the information inside the blobs to identify key-words
[+] What I've read: A lot of documentation on how to write to google cloud in chunks and then use compose to stitch it together (not helpful at all)
A lot of documentation on java's pre-fetch functions (needs to be python)
The google cloud API's
If anyone could point me the right direction I would be really grateful! Thanks
回答1:
So a way I have found of doing this is by creating a file-like object in python then using the Google-Cloud API call .download_to_file() with that file-like object.
This in essence streams data. python code looks something like this
def getStream(blob):
stream = open('myStream','wb', os.O_NONBLOCK)
streaming = blob.download_to_file(stream)
The os.O_NONBLOCK flag is so I can read while I'm writing to the file. I still haven't tested this with really big files so if anyone knows a better implementation or see's a potential failure with this please comment. Thanks!
来源:https://stackoverflow.com/questions/50380237/reading-really-big-blobs-without-downloading-them-in-google-cloud-streaming