Concat DataFrame Reindexing only valid with uniquely valued Index objects

后端 未结 4 498
余生分开走
余生分开走 2020-12-10 11:36

I am trying to concat the following:

df1

    price   side    timestamp
timestamp           
2016-01-04 00:01:15.631331072   0.7286  2   1451865675631         


        
相关标签:
4条回答
  • 2020-12-10 11:57

    pd.concat requires that the indices be unique. To remove rows with duplicate indices, use

    df = df.loc[~df.index.duplicated(keep='first')]
    

    import pandas as pd
    from pandas import Timestamp
    
    df1 = pd.DataFrame(
        {'price': [0.7286, 0.7286, 0.7286, 0.7286],
         'side': [2, 2, 2, 2],
         'timestamp': [1451865675631331, 1451865675631400,
                      1451865675631861, 1451865675631866]},
        index=pd.DatetimeIndex(['2000-1-1', '2000-1-1', '2001-1-1', '2002-1-1']))
    
    
    df2 = pd.DataFrame(
        {'bid': [0.7284, 0.7284, 0.7284, 0.7285, 0.7285],
         'bid_size': [4000000, 4000000, 5000000, 1000000, 4000000],
         'offer': [0.7285, 0.729, 0.7286, 0.7286, 0.729],
         'offer_size': [1000000, 4000000, 4000000, 4000000, 4000000]},
        index=pd.DatetimeIndex(['2000-1-1', '2001-1-1', '2002-1-1', '2003-1-1', '2004-1-1']))
    
    
    df1 = df1.loc[~df1.index.duplicated(keep='first')]
    #              price  side         timestamp
    # 2000-01-01  0.7286     2  1451865675631331
    # 2001-01-01  0.7286     2  1451865675631861
    # 2002-01-01  0.7286     2  1451865675631866
    
    df2 = df2.loc[~df2.index.duplicated(keep='first')]
    #                bid  bid_size   offer  offer_size
    # 2000-01-01  0.7284   4000000  0.7285     1000000
    # 2001-01-01  0.7284   4000000  0.7290     4000000
    # 2002-01-01  0.7284   5000000  0.7286     4000000
    # 2003-01-01  0.7285   1000000  0.7286     4000000
    # 2004-01-01  0.7285   4000000  0.7290     4000000
    
    result = pd.concat([df1, df2], axis=0)
    print(result)
                   bid  bid_size   offer  offer_size   price  side     timestamp
    2000-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
    2001-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
    2002-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
    2000-01-01  0.7284   4000000  0.7285     1000000     NaN   NaN           NaN
    2001-01-01  0.7284   4000000  0.7290     4000000     NaN   NaN           NaN
    2002-01-01  0.7284   5000000  0.7286     4000000     NaN   NaN           NaN
    2003-01-01  0.7285   1000000  0.7286     4000000     NaN   NaN           NaN
    2004-01-01  0.7285   4000000  0.7290     4000000     NaN   NaN           NaN
    

    Note there is also pd.join, which can join DataFrames based on their indices, and handle non-unique indices based on the how parameter. Rows with duplicate index are not removed.

    In [94]: df1.join(df2)
    Out[94]: 
                 price  side         timestamp     bid  bid_size   offer  \
    2000-01-01  0.7286     2  1451865675631331  0.7284   4000000  0.7285   
    2000-01-01  0.7286     2  1451865675631400  0.7284   4000000  0.7285   
    2001-01-01  0.7286     2  1451865675631861  0.7284   4000000  0.7290   
    2002-01-01  0.7286     2  1451865675631866  0.7284   5000000  0.7286   
    
                offer_size  
    2000-01-01     1000000  
    2000-01-01     1000000  
    2001-01-01     4000000  
    2002-01-01     4000000  
    
    In [95]: df1.join(df2, how='outer')
    Out[95]: 
                 price  side     timestamp     bid  bid_size   offer  offer_size
    2000-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7285     1000000
    2000-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7285     1000000
    2001-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7290     4000000
    2002-01-01  0.7286     2  1.451866e+15  0.7284   5000000  0.7286     4000000
    2003-01-01     NaN   NaN           NaN  0.7285   1000000  0.7286     4000000
    2004-01-01     NaN   NaN           NaN  0.7285   4000000  0.7290     4000000
    
    0 讨论(0)
  • 2020-12-10 11:58

    You can mitigate this error without having to change your data or remove duplicates. Just create a new index with DataFrame.reset_index:

    df = df.reset_index()
    

    The old index is kept as a column in your dataframe, but if you don't need it you can do:

    df = df.reset_index(drop=True)
    

    Some prefer:

    df.reset_index(inplace=True, drop=True)
    
    0 讨论(0)
  • 2020-12-10 12:19

    best solution from this page: https://pandas.pydata.org/pandas-docs/version/0.20/merging.html

    df = pd.concat([df1, df2], axis=1, join_axes=[df1.index])
    
    0 讨论(0)
  • 2020-12-10 12:19

    Another thing that might throw this type of errors is when you have a column with a unique value inside (entropy of 0). In this case, you can either inject small Gaussian noise in that column or remove it completely.

    0 讨论(0)
提交回复
热议问题