pandas 的官方文档:
http://pandas.pydata.org/pandas-docs/stable/indexing.html
重新索引
新对象,会根据新索引对原数据进行重排,如果是新引入的索引,则会引入缺失值(也可用 fill_value 指定填充值)。
reindex 的函数参数:
index | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An |
method | Interpolation (fill) method, see table for options. |
fill_value | Substitute value to use when introducing missing data by reindexing |
limit | |
level | |
copy | Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data). |
# obj.reindex(['a','b','c','d','e'],fill_value=0) dtype: float64
对于有序的索引序列,在重新索引时,我们可以用 method 选项进行前后填充值:
reindex 的(插值)method选项:
ffill or pad | Fill (or carry) values forward |
bfill or backfill | Fill (or carry) values backward |
对于 Dataframe 可以单独重新指定 index 和 columns,也可以同时指定,默认是重新索引行。
Dataframe 中的插值只能应用在行上(即轴0)。
In [66]: frame.reindex(['a','b','c','d']) In [67]: states = ['Texas', 'Utah', 'California'] In [68]: frame.reindex(columns=states) In [71]: frame.reindex(index=['a','b','c','d'],columns=states)
python数据分析书上利用 ix 的标签索引功能,这个在未来可能会废弃掉:
In[87]:frame.ix[['a','b','c','d'],states] W:\software\Python\Python35\Scripts\ipython:1: DeprecationWarning: .ix is deprecated. Please useFutureWarning: Passing list-likes to .loc or [] with any missing label will raiseKeyError in the future, you can use .reindex() as an alternative.
2. 删除指定轴上的项
In [96]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']) In [97]: new_obj = obj.drop('c') In [98]: obj.drop(['d', 'c'])
对于 Dataframe 可以删除任意轴上的索引值:
In [101]: data.drop(['Colorado', 'Ohio'])In [102]: data.drop('two', axis=1)In [103]: data.drop(['two', 'four'], axis=1)
3.索引、选取、过滤
Series 的类似于numpy 数组的索引:
In [102]: obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd']) In [103]: obj['b'] In [104]: obj[1] In [105]: obj[2:4] In [106]: obj[['b', 'a', 'd']] In [107]: obj[[1, 3]] In [108]: obj[obj < 2]
利用标签进行索引和赋值(其末端包含):
In [110]: obj['b':'c'] = 5
对于 Dataframe 进行索引就是选取一个或多个列:In [112]: data = pd.DataFrame(np.arange(16).reshape((4, 4)), .....: index=['Ohio', 'Colorado', 'Utah', 'New York'], .....: columns=['one', 'two', 'three', 'four']) In [115]: data[['three', 'one']]
通过切片或布尔型数组选取行:
In [116]: data[:2] In [117]: data[data['three'] > 5]通过布尔型进行索引:
In [118]: data < 5 In [119]: data[data < 5] = 0 In [120]: data
用 ix 进行索引列和行(未来可能废除,改用其他方法,例:loc、iloc):
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
In [121]: data.ix['Colorado', ['two', 'three']] In [122]: data.ix[['Colorado', 'Utah'], [3, 0, 1]] In [123]: data.ix[2] In [124]: data.ix[:'Utah', 'two'] In [125]: data.ix[data.three > 5, :3]
Dataframe 的索引选项:
obj.ix[val] | Selects single row of subset of rows from the DataFrame. |
obj.ix[:, val] | Selects single column of subset of columns. |
obj.ix[val1, val2] | Select both rows and columns. |
Conform one or more axes to new indexes. | |
Select single row or column as a Series by label. | |
Select single column or row, respectively, as a Series by integer location. | |
Select single value by row and column label. |
4. 算术运算
对象相加时,结果索引是每个对象的索引的并集,对于不重叠的索引,其值会填充 NA(可以指定填充值):
In [126]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e']) In [127]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g']) In [128]: s1 + s2
df1 + df2
In [132]: df1.add(df2,fill_value=0)
add | Method for addition (+) |
sub | Method for subtraction (-) |
div | Method for division (/) |
mul | Method for multiplication (*) |
Dataframe 和 Series 之间的运算
这两者之间的运算涉及到了广播的知识,以后会有介绍广播相关的知识。一维二维的广播都比较容易理解。
In [155]: frame - series
如果你希望匹配行且在列上广播,必须使用算数运算:
In [162]: frame.sub(series1,axis=0)
5. 函数应用于映射
In [168]: np.abs(frame)
也可用apply方法把函数应用到由各列或各行形成的一维数组上:
也可返回多个值组成的Series:
元素级的python函数也是可用的,使用applymap方法:
6. 排序与排名
对行或列索引进行排序可以使用 sort_index 方法
对Series安值进行排序,可使用sort_values方法,若某个索引缺失值,则会被放到末尾
ascending=False) 降序 In [189]: obj.sort_values() # obj.sort_values(ascending=False) 降序
对于Dataframe 可以根据任意轴上的索引进行排序,默认是升序,也可降序排序:
In [197]: frame.sort_index()In [198]: frame.sort_index(axis=1)In [199]: frame.sort_index(axis=1, ascending=False)
在 Dataframe 上还可以使用 by 关键字,根据一或多列的值进行排序:
In [203]: frame.sort_values(by='b') # FutureWarning: by argument to sort_index # is deprecated, please use .sort_values(by=...)
注意:对DataFrame的值进行排序的时候,我们必须要使用by指定某一行(列)或者某几行(列), 如果不使用by参数进行指定的时候,就会报TypeError: sort_values() missing 1 required positional argument: 'by'。 使用by参数进行某几列(行)排序的时候,以列表中的第一个为准,可能后面的不会生效,因为有的时候无法做到既对第一行(列) 进行升序排序又对第二行(列)进行排序。在指定行值进行排序的时候,必须设置axis=1,不然会报错,因为默认指定的是列索引, 找不到这个索引所以报错,axis=1的意思是指定行索引。
排名:
排名和排序有点类似,排名会有一个排名值(从1开始,一直到数组中有效数据的数量),它与numpy.argsort的间接排序索引差不多,只不过它可以根据某种规则破坏平级关系。
dtype: float64
根据值在数组中出现的顺序进行排名 :
In [216]: obj.rank(method='first') # 也可以按降序排名 obj.rank(ascending=False,method='max') 按照分组的最大排名排序 Out[216]: dtype: float64
也可以指定轴进行排名:
In [219]: frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], In [220]: frame Out[220]: In [221]: frame.rank(axis=1) Out[221]:
排名时用于破坏平级关系的method选项:
average | Default: assign the average rank to each entry in the equal group |
min | Use the minimum rank for the whole group |
max | Use the maximum rank for the whole group |
first | Assign ranks in the order the values appear in the data |
7. 汇总和计算描述统计
In [224]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], In [225]: df Out[225]: In [227]: df.sum() Out[227]: dtype: float64 In [228]: df.sum(axis=1) Out[228]: dtype: float64 In [229]: df.sum(axis=1,skipna=False) Out[229]: dtype: float64约简方法的选项
axis | Axis to reduce over. 0 for DataFrame’s rows and 1 for columns. |
skipna | by default. |
level | Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex). |
count | Number of non-NA values |
describe | Compute set of summary statistics for Series or each DataFrame column |
min, max | Compute minimum and maximum values |
argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively |
idxmin, idxmax | Compute index values at which minimum or maximum value obtained, respectively |
quantile | Compute sample quantile ranging from 0 to 1 |
sum | Sum of values |
mean | Mean of values |
median | Arithmetic median (50% quantile) of values |
mad | Mean absolute deviation from mean value |
var | Sample variance of values |
std | Sample standard deviation of values |
skew | Sample skewness (3rd moment) of values |
kurt | Sample kurtosis (4th moment) of values |
cumsum | Cumulative sum of values |
cummin, cummax | Cumulative minimum or maximum of values, respectively |
cumprod | Cumulative product of values |
diff | Compute 1st arithmetic difference (useful for time series) |
pct_change | Compute percent changes |
8.唯一值,值计数,成员资格
In [231]: obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) In [232]: uniques = obj.unique() # Ψһֵ In [233]: uniques Out[233]: array(['c', 'a', 'd', 'b'], dtype=object) In [234]: obj.value_counts() # 计数值 Out[234]: dtype: int64 In [235]: mask = obj.isin(['b', 'c']) # 成员关系 In [236]: mask Out[236]: dtype: bool Out[237]: dtype: object
isin | Compute boolean array indicating whether each Series value is contained in the passed sequence of values. |
unique | Compute array of unique values in a Series, returned in the order observed. |
value_counts |
pandas.value_counts:
In [239]: data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], In [240]: data Out[240]: In [241]: result = data.apply(pd.value_counts).fillna(0) In [242]: result Out[242]:
9.缺失值的处理
pandas 中使用浮点值 NAN (Not a Number) 表示浮点和非浮点数组中的缺失数据,只是一种便于表示的标记。
NA处理的方法:
Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. | |
'bfill'. | |
Return like-type object containing boolean values indicating which values are missing / NA. | |
isnull. |
过滤缺失数据:dropna
对于 Series ,dropna()仅仅返回一个非空数据和索引值的 Series: In [6]: from numpy import nan as NA In [7]: data = pd.Series([1,NA,4,NA,5]) In [8]: data.dropna() # 也可以通过bool索引达到此目的:data[data.notnull()] dtype: float64
对于 Dataframe ,dropna 默认丢弃任何含有缺失值的行,传入参数 how='all' ,只丢弃全为NA的行。要丢弃为NA的行,传入参数 axis=1,即可。参数 thresh 可以保留部分数据。
# 用任意数去填充所有的缺失值 # 传入一个字典,可以实现对列中缺失值的填充
用 method 参数填充数据:
In [27]: df.fillna(method='bfill')In [28]: df.fillna(method='bfill',limit=2)
fillna 参数:
Scalar value or dict-like object to use to fill missing values | |
if function called with no other arguments | |
axis=0 | |
Modify the calling object without producing a copy | |
For forward and backward filling, maximum number of consecutive periods to fill |
10. 层次化索引
能在一个轴上拥有多个索引级别,能以低维度形式处理高纬度数据。
创建一个层次化索引的 Series:
MultiIndex
索引:
In [34]: data['a'] # 实现内层索引 # 实现切片索引
层次化索引可以通过 unstack 方法生成 Dataframe 数据:
In [38]: data.unstack() In [42]: data.unstack().stack() # stack 是unstack的逆运算
对于 Dataframe 每条轴都可以有层次化索引,每个索引还都可以有名字:
swaplevel() : 调整某条轴上各级别的顺序;sort_index(): 对各级别上的数据值进行排序
sortlevel(1) W:\software\Python\Python35\Scripts\ipython:1:
有的时候我们想要把 Dataframe 的某列或某几列当做 Dataframe 的索引:
默认情况下,被当做索引的列会被移除,也可通过drop=False保存下来:
reset_index的作用跟 set_index 正好相反: