Python Dask - vertical concatenation of 2 DataFrames

99封情书 提交于 2019-12-21 04:26:59

问题


I am trying to vertically concatenate two Dask DataFrames

I have the following Dask DataFrame:

d = [
    ['A','B','C','D','E','F'],
    [1, 4, 8, 1, 3, 5],
    [6, 6, 2, 2, 0, 0],
    [9, 4, 5, 0, 6, 35],
    [0, 1, 7, 10, 9, 4],
    [0, 7, 2, 6, 1, 2]
    ]
df = pd.DataFrame(d[1:], columns=d[0])
ddf = dd.from_pandas(df, npartitions=5)

Here is the data as a Pandas DataFrame

          A         B      C      D      E      F
0         1         4      8      1      3      5
1         6         6      2      2      0      0
2         9         4      5      0      6     35
3         0         1      7     10      9      4
4         0         7      2      6      1      2

Here is the Dask DataFrame

Dask DataFrame Structure:
                   A      B      C      D      E      F
npartitions=4                                          
0              int64  int64  int64  int64  int64  int64
1                ...    ...    ...    ...    ...    ...
2                ...    ...    ...    ...    ...    ...
3                ...    ...    ...    ...    ...    ...
4                ...    ...    ...    ...    ...    ...
Dask Name: from_pandas, 4 tasks

I am trying to concatenate 2 Dask DataFrames vertically:

ddf_i = ddf + 11.5
dd.concat([ddf,ddf_i],axis=0)

but I get this error:

Traceback (most recent call last):
      ...
      File "...", line 572, in concat
        raise ValueError('All inputs have known divisions which cannot '
    ValueError: All inputs have known divisions which cannot be concatenated
    in order. Specify interleave_partitions=True to ignore order

However, if I try:

dd.concat([ddf,ddf_i],axis=0,interleave_partitions=True)

then it appears to be working. Is there a problem with setting this to True (in terms of performance - speed)? Or is there another way to vertically 2 concatenate Dask DataFrames?


回答1:


If you inspect the divisions of the dataframe ddf.divisions, you will find, assuming one partition, that it has the edges of the index there: (0, 4). This is useful to dask, as it knows when you do some operation on the data, not to use a partition not including required index values. This is also why some dask operations are much faster when the index is appropriate for the job.

When you concatenate, the second dataframe has the same index as the first. Concatenation would work without interleaving if the values of the index had different ranges in the two partitions.



来源:https://stackoverflow.com/questions/43810905/python-dask-vertical-concatenation-of-2-dataframes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!