Why `linedelimiter` does not work for bag.read_text?

*爱你&永不变心* 提交于 2020-01-16 16:28:27

问题


I am trying to load yaml from files created by

entries = bag.from_sequence([{1:2}, {3:4}])
yamls = entries.map(yaml.dump)
yamls.to_textfiles(r'\*.yaml.gz')

with

yamls = bag.read_test(r'\*.yaml.gz', linedelimiter='\n\n)

but it reads files line by line. How to read yamls from files?

UPDATE:

  1. While blocksize=None read_text reads files line by line.
  2. If blocksize is set, you could read compressed files.

How to overcome this? Is uncompressing the files is the only option?


回答1:


Indeed, linedelimiter is used not for the sense you have in mind, but only for separating the larger blocks. As you say, when you compress with gzip, the file is no longer random-accessible, and blocks cannot be used at all.

It would be possible to pass the linedelimiter into the functions that turn chunks of data into lines (in dask.bag.text, if you are interested).

For now, a workaround could look like this:

yamls = bag.read_test(r'\*.yaml.gz').map_partitions(
    lambda x: '\n'.join(x).split(delimiter))


来源:https://stackoverflow.com/questions/49320789/why-linedelimiter-does-not-work-for-bag-read-text

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!