问题
I'm trying to read and process in parallel a list of csv files and concatenate the output in a single pandas dataframe for further processing.
My workflow consist of 3 steps:
create a series of pandas dataframe by reading a list of csv files (all with the same structure)
def loadcsv(filename): df = pd.read_csv(filename) return dffor each dataframe create a new column by processing 2 existing columns
def makegeom(a,b): return 'Point(%s %s)' % (a,b)def applygeom(df): df['Geom']= df.apply(lambda row: makegeom(row['Easting'], row['Northing']), axis=1) return dfconcatenate all the dataframes in a single dataframe
frames = [] for i in csvtest: df = applygeom(loadcsv(i)) frames.append(df) mergedresult1 = pd.concat(frames)
In my workflow I use pandas (each csv (15) file has more than >> 2*10^6 data points) so it takes a while to complete. I think this kind of workflow should take advantage of some parallel processing (at least for the read_csv and apply steps) so I gave a try to dask, but I was not able to use it properly. In my attempt I did'n gain any improvement in speed.
I made a simple notebook so to replicate what I'm doing:
https://gist.github.com/epifanio/72a48ca970a4291b293851ad29eadb50
My question is ... what's the proper way to use dask to accomplish my use case?
回答1:
Pandas
In Pandas I would use the apply method
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 2, 1]})
In [3]: def makegeom(row):
...: a, b = row
...: return 'Point(%s %s)' % (a, b)
...:
In [4]: df.apply(makegeom, axis=1)
Out[4]:
0 Point(1 3)
1 Point(2 2)
2 Point(3 1)
dtype: object
Dask.dataframe
In dask.dataframe you can do the same thing
In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=2)
In [7]: ddf.apply(makegeom, axis=1).compute()
Out[7]:
0 Point(1 3)
1 Point(2 2)
2 Point(3 1)
Add new series
In either case you can then add the new series to the dataframe
df['geom'] = df[['a', 'b']].apply(makegeom)
Create
If you have CSV data then I would use the dask.dataframe.read_csv function
ddf = dd.read_csv('filenames.*.csv')
If you have other kinds of data then I would use dask.delayed
来源:https://stackoverflow.com/questions/40421508/read-process-and-concatenate-pandas-dataframe-in-parallel-with-dask