Dask Dataframe Effecient Row Pair Generator?

问题

What exactly I want to achieve in terms of input output is a cross - join

Input Example

df = pd.DataFrame(columns = ['A', 'val'], data = [['a1', 23],['a2', 29], ['a3', 39]])
print(df)

Output Example:

df['key'] = 1
df.merge(df, how = "outer", on ="key")

  A_x  val_x  key A_y  val_y
0  a1     23    1  a1     23
1  a1     23    1  a2     29
2  a1     23    1  a3     39
3  a2     29    1  a1     23
4  a2     29    1  a2     29
5  a2     29    1  a3     39
6  a3     39    1  a1     23
7  a3     39    1  a2     29
8  a3     39    1  a3     39

How I achieve this for a large dataset with Dask ?

I am interested in getting all row pair combinations of a Dask Dataframe (Similar to a Cartesian Product) to further calculate inter row metrics like distance etc.But I always get a memory error when using Dask Distributed locally, I provided a toy example of what I am trying to achieve.

I am new to dask so I just want to know is this even possible locally ? What should be my ideal paritions size ? What is a better way to get row pairs using dask?

import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
client
df = pd.DataFrame(columns = list(range(50)), data =  np.random.rand(10000,50))
ddf = dd.from_pandas(df, npartitions=10)  # rather than here
ddf = ddf.assign(key = 0)
ddf = dd.merge(ddf, ddf, suffixes=('', '_ch'), on='key', 
npartitions = 10000, how = 'outer')
ddf['0'].mean().compute()

I get the following error:

MemoryError: Unable to allocate 37.3 GiB for an 
array with shape (100000000, 50) and data type float64

Local Cluster Details

Scheduler: tcp://127.0.0.1:52435
Dashboard: http://127.0.0.1:8787/status
Cluster
Workers: 4
Cores: 12
Memory: 34.10 GB

[ Task Stream Snapshot1 ]

来源：https://stackoverflow.com/questions/62839389/dask-dataframe-effecient-row-pair-generator

标签

python

pandas

dask

dask-distributed