Adding new column to existing DataFrame in Python pandas

后端 未结 25 1047
你的背包
你的背包 2020-11-22 01:15

I have the following indexed DataFrame with named columns and rows not- continuous numbers:

          a         b         c         d
2  0.671399  0.101208 -         


        
相关标签:
25条回答
  • 2020-11-22 01:21

    To create an empty column

    df['i'] = None
    
    0 讨论(0)
  • 2020-11-22 01:22

    I was looking for a general way of adding a column of numpy.nans to a dataframe without getting the dumb SettingWithCopyWarning.

    From the following:

    • the answers here
    • this question about passing a variable as a keyword argument
    • this method for generating a numpy array of NaNs in-line

    I came up with this:

    col = 'column_name'
    df = df.assign(**{col:numpy.full(len(df), numpy.nan)})
    
    0 讨论(0)
  • 2020-11-22 01:23

    Super simple column assignment

    A pandas dataframe is implemented as an ordered dict of columns.

    This means that the __getitem__ [] can not only be used to get a certain column, but __setitem__ [] = can be used to assign a new column.

    For example, this dataframe can have a column added to it by simply using the [] accessor

        size      name color
    0    big      rose   red
    1  small    violet  blue
    2  small     tulip   red
    3  small  harebell  blue
    
    df['protected'] = ['no', 'no', 'no', 'yes']
    
        size      name color protected
    0    big      rose   red        no
    1  small    violet  blue        no
    2  small     tulip   red        no
    3  small  harebell  blue       yes
    

    Note that this works even if the index of the dataframe is off.

    df.index = [3,2,1,0]
    df['protected'] = ['no', 'no', 'no', 'yes']
        size      name color protected
    3    big      rose   red        no
    2  small    violet  blue        no
    1  small     tulip   red        no
    0  small  harebell  blue       yes
    

    []= is the way to go, but watch out!

    However, if you have a pd.Series and try to assign it to a dataframe where the indexes are off, you will run in to trouble. See example:

    df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
        size      name color protected
    3    big      rose   red       yes
    2  small    violet  blue        no
    1  small     tulip   red        no
    0  small  harebell  blue        no
    

    This is because a pd.Series by default has an index enumerated from 0 to n. And the pandas [] = method tries to be "smart"

    What actually is going on.

    When you use the [] = method pandas is quietly performing an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series. df['column'] = series

    Side note

    This quickly causes cognitive dissonance, since the []= method is trying to do a lot of different things depending on the input, and the outcome cannot be predicted unless you just know how pandas works. I would therefore advice against the []= in code bases, but when exploring data in a notebook, it is fine.

    Going around the problem

    If you have a pd.Series and want it assigned from top to bottom, or if you are coding productive code and you are not sure of the index order, it is worth it to safeguard for this kind of issue.

    You could downcast the pd.Series to a np.ndarray or a list, this will do the trick.

    df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values
    

    or

    df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))
    

    But this is not very explicit.

    Some coder may come along and say "Hey, this looks redundant, I'll just optimize this away".

    Explicit way

    Setting the index of the pd.Series to be the index of the df is explicit.

    df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)
    

    Or more realistically, you probably have a pd.Series already available.

    protected_series = pd.Series(['no', 'no', 'no', 'yes'])
    protected_series.index = df.index
    
    3     no
    2     no
    1     no
    0    yes
    

    Can now be assigned

    df['protected'] = protected_series
    
        size      name color protected
    3    big      rose   red        no
    2  small    violet  blue        no
    1  small     tulip   red        no
    0  small  harebell  blue       yes
    

    Alternative way with df.reset_index()

    Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.

    df.reset_index(drop=True)
    protected_series.reset_index(drop=True)
    df['protected'] = protected_series
    
        size      name color protected
    0    big      rose   red        no
    1  small    violet  blue        no
    2  small     tulip   red        no
    3  small  harebell  blue       yes
    

    Note on df.assign

    While df.assign make it more explicit what you are doing, it actually has all the same problems as the above []=

    df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
        size      name color protected
    3    big      rose   red       yes
    2  small    violet  blue        no
    1  small     tulip   red        no
    0  small  harebell  blue        no
    

    Just watch out with df.assign that your column is not called self. It will cause errors. This makes df.assign smelly, since there are these kind of artifacts in the function.

    df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
    TypeError: assign() got multiple values for keyword argument 'self'
    

    You may say, "Well, I'll just not use self then". But who knows how this function changes in the future to support new arguments. Maybe your column name will be an argument in a new update of pandas, causing problems with upgrading.

    0 讨论(0)
  • 2020-11-22 01:24

    If the data frame and Series object have the same index, pandas.concat also works here:

    import pandas as pd
    df
    #          a            b           c           d
    #0  0.671399     0.101208   -0.181532    0.241273
    #1  0.446172    -0.243316    0.051767    1.577318
    #2  0.614758     0.075793   -0.451460   -0.012493
    
    e = pd.Series([-0.335485, -1.166658, -0.385571])    
    e
    #0   -0.335485
    #1   -1.166658
    #2   -0.385571
    #dtype: float64
    
    # here we need to give the series object a name which converts to the new  column name 
    # in the result
    df = pd.concat([df, e.rename("e")], axis=1)
    df
    
    #          a            b           c           d           e
    #0  0.671399     0.101208   -0.181532    0.241273   -0.335485
    #1  0.446172    -0.243316    0.051767    1.577318   -1.166658
    #2  0.614758     0.075793   -0.451460   -0.012493   -0.385571
    

    In case they don't have the same index:

    e.index = df.index
    df = pd.concat([df, e.rename("e")], axis=1)
    
    0 讨论(0)
  • 2020-11-22 01:25

    To add a new column, 'e', to the existing data frame

     df1.loc[:,'e'] = Series(np.random.randn(sLength))
    
    0 讨论(0)
  • 2020-11-22 01:25

    this is a special case of adding a new column to a pandas dataframe. Here, I am adding a new feature/column based on an existing column data of the dataframe.

    so, let our dataFrame has columns 'feature_1', 'feature_2', 'probability_score' and we have to add a new_column 'predicted_class' based on data in column 'probability_score'.

    I will use map() function from python and also define a function of my own which will implement the logic on how to give a particular class_label to every row in my dataFrame.

    data = pd.read_csv('data.csv')
    
    def myFunction(x):
       //implement your logic here
    
       if so and so:
            return a
       return b
    
    variable_1 = data['probability_score']
    predicted_class = variable_1.map(myFunction)
    
    data['predicted_class'] = predicted_class
    
    // check dataFrame, new column is included based on an existing column data for each row
    data.head()
    
    0 讨论(0)
提交回复
热议问题