问题
My dataframe has subcategory, under each category (cat, dog, bird), stats information is presented. I need to remove the rows if they contain info in count and freq, and only keep rows with sd and mean values. Some values are NaN.
ValueError occurs in my codes.
df:
var stats A B C
cat mean 2 3 4
NaN sd 2 1 3
NaN count 5 2 6
NaN freq 3 1 19
dog mean 8 1 2
NaN sd 2 1 3
NaN count 4 6 1
NaN freq 3 1 19
bird mean 2 3 4
NaN sd 2 1 3
NaN count 5 2 6
NaN freq NaN NaN NaN
My codes:
rows = ['count', 'freq']
df = [df.stats != rows]
Expected outcome
var stats A B C
cat mean 2 3 4
NaN sd 2 1 3
dog mean 8 1 2
NaN sd 2 1 3
bird mean 2 3 4
NaN sd 2 1 3
error:
File "pandas/_libs/lib.pyx", line 805, in pandas._libs.lib.vec_compare
(pandas/_libs/lib.c:14288)
ValueError: Arrays were different lengths: 819 vs 9
I am not sure how to check the array length, but in my excel spreadsheet, all columns and rows have the same length. Is this error caused by NaN/empty cell in my data?
Thanks!
回答1:
!= will not work here. Use pd.Series.isin to obtain a mask you'll then use to filter your dataframe.
m = ~df.stats.isin(['count', 'freq'])
print(m)
0 True
1 True
2 False
3 False
4 True
5 True
6 False
7 False
8 True
9 True
10 False
11 False
Name: stats, dtype: bool
print(df[m])
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0
回答2:
you can use SQL-like query() method:
In [163]: df.query("stats not in ['count','freq']")
Out[163]:
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0
or using your rows variable:
In [164]: df.query("stats not in @rows")
Out[164]:
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0
回答3:
For fun!
rows = ['count', 'freq']
df.merge(pd.DataFrame(dict(stats=np.setdiff1d(df.stats, rows))))
var stats A B C
0 cat mean 2.0 3.0 4.0
1 dog mean 8.0 1.0 2.0
2 bird mean 2.0 3.0 4.0
3 NaN sd 2.0 1.0 3.0
4 NaN sd 2.0 1.0 3.0
5 NaN sd 2.0 1.0 3.0
Another interesting way with index and drop
df.set_index('stats').drop(rows).reset_index()
stats var A B C
0 mean cat 2.0 3.0 4.0
1 sd NaN 2.0 1.0 3.0
2 mean dog 8.0 1.0 2.0
3 sd NaN 2.0 1.0 3.0
4 mean bird 2.0 3.0 4.0
5 sd NaN 2.0 1.0 3.0
回答4:
LOL :)
df[[x not in rows for x in df.stats]]
Out[520]:
var stats A B C
0 cat mean 2.0 3.0 4.0
1 NaN sd 2.0 1.0 3.0
4 dog mean 8.0 1.0 2.0
5 NaN sd 2.0 1.0 3.0
8 bird mean 2.0 3.0 4.0
9 NaN sd 2.0 1.0 3.0
来源:https://stackoverflow.com/questions/46655712/remove-rows-and-valueerror-arrays-were-different-lengths