Why do we use 'loc' for pandas dataframes? it seems the following code with or without using loc both compile anr run at a simulular speed
%timeit df_user1 = df.loc[df.user_id=='5561']
100 loops, best of 3: 11.9 ms per loop
or
%timeit df_user1_noloc = df[df.user_id=='5561']
100 loops, best of 3: 12 ms per loop
So why use loc?
Edit: This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that *
you can do column retrieval just by using the data frame's getitem:
*
df['time'] # equivalent to df.loc[:, 'time']
it does not say why we use loc, although it does explain lots of features of loc, my specific question is 'why not just omit loc altogether'? for which i have accepted a very detailed answer below.
Also that other post the answer (which i do not think is an answer) is very hidden in the discussion and any person searching for what i was looking for would find it hard to locate the information and would be much better served by the answer provided to my question.
Explicit is better than implicit.
df[boolean_mask]selects rows whereboolean_maskis True, but there is a corner case when you might not want it to: whendfhas boolean-valued column labels:In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df Out[229]: False True 0 3 1 1 4 2 2 5 3You might want to use
df[[True]]to select theTruecolumn. Instead it raises aValueError:In [230]: df[[True]] ValueError: Item wrong length 1 instead of 3.Versus using
loc:In [231]: df.loc[[True]] Out[231]: False True 0 3 1In contrast, the following does not raise
ValueErroreven though the structure ofdf2is almost the same asdf1above:In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2 Out[258]: A B 0 1 3 1 2 4 2 3 5 In [259]: df2[['B']] Out[259]: B 0 3 1 4 2 5Thus,
df[boolean_mask]does not always behave the same asdf.loc[boolean_mask]. Even though this is arguably an unlikely use case, I would recommend always usingdf.loc[boolean_mask]instead ofdf[boolean_mask]because the meaning ofdf.loc's syntax is explicit. Withdf.loc[indexer]you know automatically thatdf.locis selecting rows. In contrast, it is not clear ifdf[indexer]will select rows or columns (or raiseValueError) without knowing details aboutindexeranddf.df.loc[row_indexer, column_index]can select rows and columns.df[indexer]can only select rows or columns depending on the type of values inindexerand the type of column valuesdfhas (again, are they boolean?).In [237]: df2.loc[[True,False,True], 'B'] Out[237]: 0 3 2 5 Name: B, dtype: int64When a slice is passed to
df.locthe end-points are included in the range. When a slice is passed todf[...], the slice is interpreted as a half-open interval:In [239]: df2.loc[1:2] Out[239]: A B 1 2 4 2 3 5 In [271]: df2[1:2] Out[271]: A B 1 2 4
来源:https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc