Python Pandas Multiprocessing Apply

匿名 (未验证) 提交于 2019-12-03 08:46:08

问题:

I am wondering if there is a way to do a pandas dataframe apply function in parallel. I have looked around and haven't found anything. At least in theory I think it should be fairly simple to implement but haven't seen anything. This is practically the textbook definition of parallel after all.. Has anyone else tried this or know of a way? If no one has any ideas I think I might just try writing it myself.

The code I am working with is below. Sorry for the lack of import statements. They are mixed in with a lot of other things.

def apply_extract_entities(row):      names=[]      counter=0      print row      for sent in nltk.sent_tokenize(open(row['file_name'], "r+b").read()):          for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):              if hasattr(chunk, 'node'):                  names+= [chunk.node, ' '.join(c[0] for c in chunk.leaves())]                  counter+=1                  print counter      return names  data9_2['proper_nouns']=data9_2.apply(apply_extract_entities, axis=1) 

EDIT:

So here is what I tried. I tried running it with just the first five element of my iterable and it is taking longer than it would if I ran it serially so I assume it is not working.

os.chdir(str(home)) data9_2=pd.read_csv('edgarsdc3.csv') os.chdir(str(home)+str('//defmtest'))  #import stuff from nltk import pos_tag, ne_chunk from nltk.tokenize import SpaceTokenizer  #define apply function and apply it os.chdir(str(home)+str('//defmtest'))  ####  #this is our apply function def apply_extract_entities(row):     names=[]     counter=0     print row     for sent in nltk.sent_tokenize(open(row['file_name'], "r+b").read()):         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):             if hasattr(chunk, 'node'):             names+= [chunk.node, ' '.join(c[0] for c in chunk.leaves())]             counter+=1             print counter     return names   #need something that populates a list of sections of a dataframe def dataframe_splitter(df):      df_list=range(len(df))      for i in xrange(len(df)):          sliced=df.ix[i]          df_list[i]=sliced      return df_list  df_list=dataframe_splitter(data9_2) #df_list=range(len(data9_2)) print df_list  #the multiprocessing section import multiprocessing  def worker(arg):     print arg     (arg)['proper_nouns']=arg.apply(apply_extract_entities, axis=1)     return arg  pool = multiprocessing.Pool(processes=10)  # get list of pieces res = pool.imap_unordered(worker, df_list[:5]) res2= list(itertools.chain(*res)) pool.close() pool.join()  # re-assemble pieces into the final output output = data9_2.head(1).concatenate(res) print output.head()

回答1:

With multiprocessing, it's best to generate several large blocks of data, then re-assemble them to produce the final output.

source

import multiprocessing  def worker(arg):     return arg*2  pool = multiprocessing.Pool()  # get list of pieces res = pool.map(worker, [1,2,3]) pool.close() pool.join()  # re-assemble pieces into the final output output = sum(res) print 'got:',output

output

got: 12


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!