问题
I am trying to load yaml from files created by
entries = bag.from_sequence([{1:2}, {3:4}])
yamls = entries.map(yaml.dump)
yamls.to_textfiles(r'\*.yaml.gz')
with
yamls = bag.read_test(r'\*.yaml.gz', linedelimiter='\n\n)
but it reads files line by line. How to read yamls from files?
UPDATE:
- While
blocksize=None
read_text
reads files line by line. - If
blocksize
is set, you could read compressed files.
How to overcome this? Is uncompressing the files is the only option?
回答1:
Indeed, linedelimiter
is used not for the sense you have in mind, but only for separating the larger blocks. As you say, when you compress with gzip, the file is no longer random-accessible, and blocks cannot be used at all.
It would be possible to pass the linedelimiter
into the functions that turn chunks of data into lines (in dask.bag.text
, if you are interested).
For now, a workaround could look like this:
yamls = bag.read_test(r'\*.yaml.gz').map_partitions(
lambda x: '\n'.join(x).split(delimiter))
来源:https://stackoverflow.com/questions/49320789/why-linedelimiter-does-not-work-for-bag-read-text