Iterate over pandas dataframe columns containing nested arrays

前端 未结 4 1742
情话喂你
情话喂你 2021-01-21 01:46

I hope you can help me with this issue,

I\'ve this data below (Columns names whatever)

data=([[\'file0090\',
    ([[ 84,  55, 189],
   [248, 100,  18],
         


        
4条回答
  •  野性不改
    2021-01-21 02:14

    You can create a custom function to output the correct form of data.

    from itertools import chain
    def transform(d):
        for l in d:
            *x, y = l
            yield list(map(lambda s: x+s, y))
    
    df = pd.DataFrame(chain(*transform(data)))
    df
              0    1    2    3
    0  file0090   84   55  189
    1  file0090  248  100   18
    2  file0090   68  115   88
    3  file6565   86   58  189
    4  file6565   24   10  118
    5  file6565   68   11    8
    

    Timeit results of all the solutions:

    # YOBEN_S's answer
    In [275]: %%timeit
         ...: s = pd.DataFrame(data).set_index(0)[1].explode()
         ...: df = pd.DataFrame(s.tolist(), index = s.index.values)
         ...:
         ...:
    1.52 ms ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    #Anky's answer
    In [276]: %%timeit
         ...: df = pd.DataFrame(data).add_prefix('col')
         ...: out = df.explode('col1').reset_index(drop=True)
         ...: out = out.join(pd.DataFrame(out.pop('col1').tolist()).add_prefix('col_'))
         ...:
         ...:
    3.71 ms ± 606 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    #Dhaval's answer
    In [277]: %%timeit
         ...: data_f = []
         ...: for i in data:
         ...:     for j in i[1]:
         ...:         data_f.append([i[0]]+j)
         ...: df = pd.DataFrame(data_f, columns =['col0','col1','col2','col3'])
         ...:
         ...:
    712 µs ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    #My answer
    In [280]: %%timeit
         ...: pd.DataFrame(chain(*transform(data)))
         ...:
         ...:
    489 µs ± 8.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    #Using List comp of Dhaval's answer
    
    In [306]: %%timeit
         ...: data_f = [[i[0]]+j for i in data for j in i[1]]
         ...: df = pd.DataFrame(data_f, columns =['col0','col1','col2','col3'])
         ...:
         ...:
    586 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    #Anky's 2nd solution
    
    In [308]: %%timeit
         ...: l = [*chain.from_iterable(data)]
         ...: pd.DataFrame(np.vstack(l[1::2]),index = np.repeat(l[::2],len(l[1])))
         ...:
         ...:
    221 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

提交回复
热议问题