pandas之基本功能 | 易学教程

pandas 的官方文档：

http://pandas.pydata.org/pandas-docs/stable/indexing.html

重新索引

新对象，会根据新索引对原数据进行重排，如果是新引入的索引，则会引入缺失值(也可用 fill_value 指定填充值)。

reindex 的函数参数：

index	New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An
method	Interpolation (fill) method, see table for options.
fill_value	Substitute value to use when introducing missing data by reindexing
limit
level
copy	Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data).

# obj.reindex(['a','b','c','d','e']，fill_value=0) dtype: float64

对于有序的索引序列，在重新索引时，我们可以用 method 选项进行前后填充值：

reindex 的(插值)method选项：

ffill or pad	Fill (or carry) values forward
bfill or backfill	Fill (or carry) values backward

对于 Dataframe 可以单独重新指定 index 和 columns，也可以同时指定，默认是重新索引行。

Dataframe 中的插值只能应用在行上(即轴0)。

In [66]: frame.reindex(['a','b','c','d']) In [67]: states = ['Texas', 'Utah', 'California'] In [68]: frame.reindex(columns=states) In [71]: frame.reindex(index=['a','b','c','d'],columns=states)

python数据分析书上利用 ix 的标签索引功能，这个在未来可能会废弃掉：

In[87]:frame.ix[['a','b','c','d'],states] W:\software\Python\Python35\Scripts\ipython:1: DeprecationWarning: .ix is deprecated. Please useFutureWarning: Passing list-likes to .loc or [] with any missing label will raiseKeyError in the future, you can use .reindex() as an alternative.

2. 删除指定轴上的项

In [96]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']) In [97]: new_obj = obj.drop('c') In [98]: obj.drop(['d', 'c'])

对于 Dataframe 可以删除任意轴上的索引值：

In [101]: data.drop(['Colorado', 'Ohio'])In [102]: data.drop('two', axis=1)In [103]: data.drop(['two', 'four'], axis=1)

3.索引、选取、过滤

Series 的类似于numpy 数组的索引：

In [102]: obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd']) In [103]: obj['b'] In [104]: obj[1] In [105]: obj[2:4] In [106]: obj[['b', 'a', 'd']] In [107]: obj[[1, 3]] In [108]: obj[obj < 2]

利用标签进行索引和赋值(其末端包含)：

In [110]: obj['b':'c'] = 5

对于 Dataframe 进行索引就是选取一个或多个列：

In [112]: data = pd.DataFrame(np.arange(16).reshape((4, 4)), .....: index=['Ohio', 'Colorado', 'Utah', 'New York'], .....: columns=['one', 'two', 'three', 'four']) In [115]: data[['three', 'one']]

通过切片或布尔型数组选取行：

In [116]: data[:2]     In [117]: data[data['three'] > 5]

通过布尔型进行索引：

In [118]: data < 5 In [119]: data[data < 5] = 0 In [120]: data

用 ix 进行索引列和行(未来可能废除，改用其他方法，例：loc、iloc)：

http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

In [121]: data.ix['Colorado', ['two', 'three']] In [122]: data.ix[['Colorado', 'Utah'], [3, 0, 1]] In [123]: data.ix[2] In [124]: data.ix[:'Utah', 'two'] In [125]: data.ix[data.three > 5, :3]

Dataframe 的索引选项：

obj.ix[val]	Selects single row of subset of rows from the DataFrame.
obj.ix[:, val]	Selects single column of subset of columns.
obj.ix[val1, val2]	Select both rows and columns.
	Conform one or more axes to new indexes.
	Select single row or column as a Series by label.
	Select single column or row, respectively, as a Series by integer location.
	Select single value by row and column label.

4. 算术运算

对象相加时，结果索引是每个对象的索引的并集，对于不重叠的索引，其值会填充 NA(可以指定填充值)：

In [126]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e']) In [127]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g']) In [128]: s1 + s2

df1 + df2

In [132]: df1.add(df2,fill_value=0)

add	Method for addition (+)
sub	Method for subtraction (-)
div	Method for division (/)
mul	Method for multiplication (*)

Dataframe 和 Series 之间的运算

这两者之间的运算涉及到了广播的知识，以后会有介绍广播相关的知识。一维二维的广播都比较容易理解。

In [155]: frame - series

如果你希望匹配行且在列上广播，必须使用算数运算：

In [162]: frame.sub(series1,axis=0)

5. 函数应用于映射

In [168]: np.abs(frame)

也可用apply方法把函数应用到由各列或各行形成的一维数组上：

也可返回多个值组成的Series：

元素级的python函数也是可用的，使用applymap方法：

6. 排序与排名

对行或列索引进行排序可以使用 sort_index 方法

对Series安值进行排序，可使用sort_values方法，若某个索引缺失值，则会被放到末尾

ascending=False)  降序 In [189]: obj.sort_values()   # obj.sort_values(ascending=False)  降序

对于Dataframe 可以根据任意轴上的索引进行排序，默认是升序，也可降序排序：

In [197]: frame.sort_index()In [198]: frame.sort_index(axis=1)In [199]: frame.sort_index(axis=1, ascending=False)

在 Dataframe 上还可以使用 by 关键字，根据一或多列的值进行排序：

In [203]: frame.sort_values(by='b')    # FutureWarning: by argument to sort_index                                        # is deprecated, please use .sort_values(by=...)

注意：对DataFrame的值进行排序的时候，我们必须要使用by指定某一行（列）或者某几行（列），      如果不使用by参数进行指定的时候，就会报TypeError: sort_values() missing 1 required positional argument: 'by'。      使用by参数进行某几列（行）排序的时候，以列表中的第一个为准，可能后面的不会生效，因为有的时候无法做到既对第一行（列）      进行升序排序又对第二行（列）进行排序。在指定行值进行排序的时候，必须设置axis=1，不然会报错，因为默认指定的是列索引，      找不到这个索引所以报错，axis=1的意思是指定行索引。

排名：

排名和排序有点类似，排名会有一个排名值（从1开始，一直到数组中有效数据的数量），它与numpy.argsort的间接排序索引差不多，只不过它可以根据某种规则破坏平级关系。

dtype: float64

根据值在数组中出现的顺序进行排名：

In [216]: obj.rank(method='first')  # 也可以按降序排名  obj.rank(ascending=False,method='max') 按照分组的最大排名排序 Out[216]: dtype: float64

也可以指定轴进行排名：

In [219]: frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],  In [220]: frame Out[220]:  In [221]: frame.rank(axis=1) Out[221]:

排名时用于破坏平级关系的method选项：

average	Default: assign the average rank to each entry in the equal group
min	Use the minimum rank for the whole group
max	Use the maximum rank for the whole group
first	Assign ranks in the order the values appear in the data

7. 汇总和计算描述统计

In [224]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],   In [225]: df Out[225]:  In [227]: df.sum() Out[227]: dtype: float64 In [228]: df.sum(axis=1) Out[228]: dtype: float64 In [229]: df.sum(axis=1,skipna=False) Out[229]: dtype: float64

约简方法的选项

axis	Axis to reduce over. 0 for DataFrame’s rows and 1 for columns.
skipna	by default.
level	Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex).

描述和汇总统计

count	Number of non-NA values
describe	Compute set of summary statistics for Series or each DataFrame column
min, max	Compute minimum and maximum values
argmin, argmax	Compute index locations (integers) at which minimum or maximum value obtained, respectively
idxmin, idxmax	Compute index values at which minimum or maximum value obtained, respectively
quantile	Compute sample quantile ranging from 0 to 1
sum	Sum of values
mean	Mean of values
median	Arithmetic median (50% quantile) of values
mad	Mean absolute deviation from mean value
var	Sample variance of values
std	Sample standard deviation of values
skew	Sample skewness (3rd moment) of values
kurt	Sample kurtosis (4th moment) of values
cumsum	Cumulative sum of values
cummin, cummax	Cumulative minimum or maximum of values, respectively
cumprod	Cumulative product of values
diff	Compute 1st arithmetic difference (useful for time series)
pct_change	Compute percent changes

8.唯一值，值计数，成员资格

In [231]: obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) In [232]: uniques = obj.unique()      # Ψһֵ In [233]: uniques Out[233]: array(['c', 'a', 'd', 'b'], dtype=object) In [234]: obj.value_counts()        # 计数值 Out[234]: dtype: int64 In [235]: mask = obj.isin(['b', 'c'])         # 成员关系 In [236]: mask Out[236]: dtype: bool  Out[237]: dtype: object

isin	Compute boolean array indicating whether each Series value is contained in the passed sequence of values.
unique	Compute array of unique values in a Series, returned in the order observed.
value_counts

pandas.value_counts：

In [239]: data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],  In [240]: data Out[240]:  In [241]: result = data.apply(pd.value_counts).fillna(0) In [242]: result Out[242]:

9.缺失值的处理

pandas 中使用浮点值 NAN (Not a Number) 表示浮点和非浮点数组中的缺失数据，只是一种便于表示的标记。

NA处理的方法：

	Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
	'bfill'.
	Return like-type object containing boolean values indicating which values are missing / NA.
	isnull.

过滤缺失数据：dropna

对于 Series ，dropna()仅仅返回一个非空数据和索引值的 Series： In [6]: from numpy import nan as NA In [7]: data = pd.Series([1,NA,4,NA,5]) In [8]: data.dropna()           # 也可以通过bool索引达到此目的：data[data.notnull()] dtype: float64

对于 Dataframe ，dropna 默认丢弃任何含有缺失值的行，传入参数 how='all' ，只丢弃全为NA的行。要丢弃为NA的行，传入参数 axis=1，即可。参数 thresh 可以保留部分数据。

    # 用任意数去填充所有的缺失值 # 传入一个字典，可以实现对列中缺失值的填充

用 method 参数填充数据：

In [27]: df.fillna(method='bfill')In [28]: df.fillna(method='bfill',limit=2)

fillna 参数：

	Scalar value or dict-like object to use to fill missing values
	if function called with no other arguments
	axis=0
	Modify the calling object without producing a copy
	For forward and backward filling, maximum number of consecutive periods to fill

10. 层次化索引

能在一个轴上拥有多个索引级别，能以低维度形式处理高纬度数据。

创建一个层次化索引的 Series：

MultiIndex

索引：

In [34]: data['a'] # 实现内层索引 # 实现切片索引

层次化索引可以通过 unstack 方法生成 Dataframe 数据：

In [38]: data.unstack() In [42]: data.unstack().stack()        # stack 是unstack的逆运算

对于 Dataframe 每条轴都可以有层次化索引，每个索引还都可以有名字：

swaplevel() : 调整某条轴上各级别的顺序；sort_index(): 对各级别上的数据值进行排序

sortlevel(1) W:\software\Python\Python35\Scripts\ipython:1:

有的时候我们想要把 Dataframe 的某列或某几列当做 Dataframe 的索引：

默认情况下，被当做索引的列会被移除，也可通过drop=False保存下来：

reset_index的作用跟 set_index 正好相反：

文章来源: pandas之基本功能

标签

axis

pandas

dataframe

data