Force dask to_parquet to write single file

问题

When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dask's to_parquet (without using compute() to create a pandas df) to just write a single file?

回答1:

Writing to a single file is very hard within a parallelism system. Sorry, such an option is not offered by Dask (nor probably any other parallel processing library).

You could in theory perform the operation with a non-trivial amount of work on your part: you would need to iterate through the partitions of your dataframe, write to the target file (which you keep open) and accumulate the output row-groups into the final metadata footer of the file. I would know how to go about this with fastparquet, but that library is not being much developed any more.

回答2:

There is a reasons to have multiple files (in particular when a single big file doesn't fit in memory) but if you really need 1 only you could try this

import dask.dataframe as dd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(1_000,5))

df = dd.from_pandas(df, npartitions=4)
df.repartition(npartitions=1).to_parquet("data")

来源：https://stackoverflow.com/questions/61108376/force-dask-to-parquet-to-write-single-file

标签

python

pandas

dask

parquet

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!