Suppose that five files are imported to the DASK using csv_read. To do this, I use this code:
import dask.dataframe as dd
data = dd.read_csv(final_file_list_msg, header = None)
Every file has ten columns. I want to add 1 to the first column of file 1, 2 to the first column of file 2, 3 to the first column of file 3, etc.
Let assume that you have several files following this scheme:
dummy/
├── file01.csv
├── file02.csv
├── file03.csv
First we create them via
import os
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask import delayed
fldr = "dummy"
if not os.path.exists(fldr):
os.mkdir(fldr)
for i in range(10):
df = pd.DataFrame(np.random.rand(5,3))
df.to_csv("{}/file{:02}.csv".format(fldr,i+1),
index=False)
The list of file created is fns = sorted(os.listdir(fldr))
Then we write a function that given the path fn:
- read the file
- takes the number XX in
fileXX.csv - insert
int(XX)on the first column
That is
def addCol(fn):
df = pd.read_csv(os.path.join(fldr, fn))
first = int(fn.split(".")[0][-2:])
df.insert(0, "first", first)
return df
We wanted this fun to be delayed and we can achieve it using the decorator @delayed or wrapping the function with delayed. So to obtain the desired output we should fire (accordingly)
ddf = dd.from_delayed([addCol(fn) for fn in fns])ddf = dd.from_delayed([delayed(addCol)(fn) for fn in fns])
来源:https://stackoverflow.com/questions/54872997/add-a-value-to-a-column-of-dask-data-frames-imported-using-csv-read