Add a value to a column of DASK data-frames imported using csv_read

倖福魔咒の 提交于 2019-12-01 12:38:01

Let assume that you have several files following this scheme:

dummy/
├── file01.csv
├── file02.csv
├── file03.csv

First we create them via

import os
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask import delayed

fldr = "dummy"

if not os.path.exists(fldr):
    os.mkdir(fldr)

for i in range(10):
    df = pd.DataFrame(np.random.rand(5,3))
    df.to_csv("{}/file{:02}.csv".format(fldr,i+1),
              index=False)

The list of file created is fns = sorted(os.listdir(fldr))

Then we write a function that given the path fn:

  • read the file
  • takes the number XX in fileXX.csv
  • insert int(XX) on the first column

That is

def addCol(fn):
    df = pd.read_csv(os.path.join(fldr, fn))
    first = int(fn.split(".")[0][-2:])
    df.insert(0, "first", first)
    return df

We wanted this fun to be delayed and we can achieve it using the decorator @delayed or wrapping the function with delayed. So to obtain the desired output we should fire (accordingly)

  • ddf = dd.from_delayed([addCol(fn) for fn in fns])
  • ddf = dd.from_delayed([delayed(addCol)(fn) for fn in fns])
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!