pyspark's flatMap in pandas

前端 未结 3 1297
南旧
南旧 2021-02-04 12:06

Is there an operation in pandas that does the same as flatMap in pyspark?

flatMap example:

>>> rdd = sc.parallelize([2, 3, 4])
>>> sort         


        
3条回答
  •  自闭症患者
    2021-02-04 12:59

    I suspect that the answer is "no, not efficiently."

    Pandas isn't built for nested data like this. I suspect that the case you're considering in Pandas looks a bit like the following:

    In [1]: import pandas as pd
    
    In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
    
    In [3]: df
    Out[3]: 
               x
    0     [1, 2]
    1  [3, 4, 5]
    

    And that you want something like the following

        x
    0   1
    0   2
    1   3
    1   4
    1   5
    

    It is far more typical to normalize your data in Python before you send it to Pandas. If Pandas did do this then it would probably only be able to operate at slow Python speeds rather than fast C speeds.

    Generally one does a bit of munging of data before one uses tabular computation.

提交回复
热议问题