pandas: iterating over DataFrame index with loc

前端 未结 2 1707
星月不相逢
星月不相逢 2020-12-16 03:28

I can\'t seem to find the reasoning behind the behaviour of .loc. I know it is label based, so if I iterate over Index object the following minimal example should work. But

相关标签:
2条回答
  • 2020-12-16 04:17

    The problem is not in df.loc; df.loc[idx, 'Weekday'] is just returning a Series. The surprising behavior is due to the way pd.Series tries to cast datetime-like values to Timestamps.

    df.loc[0, 'Weekday']
    

    forms the Series

    pd.Series(np.array([pd.Timestamp('2014-01-01 00:00:00'), 'WED'], dtype=object))
    

    When pd.Series(...) is called, it tries to cast the data to an appropriate dtype.

    If you trace through the code, you'll find that it eventually arrives at these lines in pandas.core.common._possibly_infer_to_datetimelike:

    sample = v[:min(3,len(v))]
    inferred_type = lib.infer_dtype(sample)
    

    which is inspecting the first few elements of the data and trying to infer the dtype. When one of the values is a pd.Timestamp, Pandas checks to see if all the data can be cast as Timestamps. Indeed, 'Wed' can be cast to pd.Timestamp:

    In [138]: pd.Timestamp('Wed')
    Out[138]: Timestamp('2014-12-17 00:00:00')
    

    This is the root of the problem, which results in pd.Series returning two Timestamps instead of a Timestamp and a string:

    In [139]: pd.Series(np.array([pd.Timestamp('2014-01-01 00:00:00'), 'WED'], dtype=object))
    Out[139]: 
    0   2014-01-01
    1   2014-12-17
    dtype: datetime64[ns]
    

    and thus this returns

    In [140]: df.loc[0, 'Weekday']
    Out[140]: Timestamp('2014-12-17 00:00:00')
    

    instead of 'Wed'.


    Alternative: select the Series df['Weekday'] first:

    There are many workarounds; EdChum shows that adding a non-datelike (integer) value to the sample can prevent pd.Series from casting all the values to Timestamps.

    Alternatively, you could access df['Weekdays'] before using .loc:

    for idx in df.index:
        print df['Weekday'].loc[idx]
    

    Alternative: df.loc[[idx], 'Weekday']:

    Another alternative is

    for idx in df.index:
        print df.loc[[idx], 'Weekday'].item()
    

    df.loc[[idx], 'Weekday'] first selects the DataFrame df.loc[[idx]]. For example, when idx equals 0,

    In [10]: df.loc[[0]]
    Out[10]: 
            Date Weekday
    0 2014-01-01     WED
    

    whereas df.loc[0] returns the Series:

    In [11]: df.loc[0]
    Out[11]: 
    Date      2014-01-01
    Weekday   2014-12-17
    Name: 0, dtype: datetime64[ns]
    

    Series tries to cast the values to a single useful dtype. DataFrames can have a different dtype for each column. So the Timestamp in the Date column does not affect the dtype of the value in the Weekday column.

    So the problem was avoided by using an index selector which returns a DataFrame.


    Alternative: use integers for Weekday

    Yet another alternative is to store the isoweekday integer in Weekday, and convert to strings only at the end when you print:

    import datetime
    import pandas as pd
    
    dict_weekday = {1: 'MON', 2: 'TUE', 3: 'WED', 4: 'THU', 5: 'FRI', 6: 'SAT', 7: 'SUN'}
    df = pd.DataFrame(pd.date_range(datetime.date(2014, 1, 1), datetime.date(2014, 1, 15), freq='D'),   columns=['Date'])
    df['Weekday'] = df['Date'].dt.weekday+1   # add 1 for isoweekday
    
    for idx in df.index:
        print dict_weekday[df.loc[idx, 'Weekday']]
    

    Alternative: use df.ix:

    df.loc is a _LocIndexer, whereas df.ix is a _IXIndexer. They have different __getitem__ methods. If you step through the code (for example, using pdb) you'll find that df.ix calls df.getvalue:

    def __getitem__(self, key):
        if type(key) is tuple:
            try:
                values = self.obj.get_value(*key)
    

    and the DataFrame method df.get_value succeeds in returning 'WED':

    In [14]: df.get_value(0, 'Weekday')
    Out[14]: 'WED'
    

    This is why df.ix is another alternative that works here.

    0 讨论(0)
  • 2020-12-16 04:30

    This seems like a bug to me, for reference I am using python 3.3.5 64-bit, pandas 0.15.1 and numpy 1.9.1:

    Your code shows that although it is printing as strings the dtype is a timestamp:

    In [56]:
    
    df.iloc[0]['Weekday']
    Out[56]:
    Timestamp('2014-12-17 00:00:00')
    

    If I do the following then it stays as a string:

    In [58]:
    
    df['Weekday'] = df['Date'].apply(lambda x: dict_weekday[x.isoweekday()])
    df['WeekdayInt'] = df['Date'].map(lambda x: x.isoweekday())
    df.iloc[0]['Weekday']
    Out[58]:
    'WED'
    

    The above is odd as all I did was add a second column.

    Similarly if I create a column to store the int day value and then perform the apply then it works also:

    In [60]:
    
    df['WeekdayInt'] = df['Date'].map(lambda x: x.isoweekday())
    df['Weekday'] = df['WeekdayInt'].apply(lambda x: dict_weekday[x])
    df.iloc[0]['Weekday']
    Out[60]:
    'WED'
    

    It looks like the dtype is somehow persisting or not being assigned correctly if it's the first column appended.

    0 讨论(0)
提交回复
热议问题