I have a DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({\'foo.aa\': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
\'foo.fighters
Just perform a list comprehension to create your columns:
In [28]:
filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:
df[filter_col]
Out[29]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
Another method is to create a series from the columns and use the vectorised str method startswith:
In [33]:
df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
In order to achieve what you want you need to add the following to filter the values that don't meet your ==1 criteria:
In [36]:
df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 NaN 1 NaN NaN NaN NaN NaN
1 NaN NaN NaN 1 NaN NaN NaN
2 NaN NaN NaN NaN 1 NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN 1 NaN NaN NaN NaN
EDIT
OK after seeing what you want the convoluted answer is this:
In [72]:
df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
Another option for the selection of the desired entries is to use map:
df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]
which gives you all the columns for rows that contain a 1:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
5 6.8 1 0 5 0
The row selection is done by
(df == 1).any(axis=1)
as in @ajcr's answer which gives you:
0 True
1 True
2 True
3 False
4 False
5 True
dtype: bool
meaning that row 3 and 4 do not contain a 1 and won't be selected.
The selection of the columns is done using Boolean indexing like this:
df.columns.map(lambda x: x.startswith('foo'))
In the example above this returns
array([False, True, True, True, True, True, False], dtype=bool)
So, if a column does not start with foo, False is returned and the column is therefore not selected.
If you just want to return all rows that contain a 1 - as your desired output suggests - you can simply do
df.loc[(df == 1).any(axis=1)]
which returns
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
Now that pandas' indexes support string operations, arguably the simplest and best way to select columns beginning with 'foo' is just:
df.loc[:, df.columns.str.startswith('foo')]
Alternatively, you can filter column (or row) labels with df.filter(). To specify a regular expression to match the names beginning with foo.:
>>> df.filter(regex=r'^foo\.', axis=1)
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
To select only the required rows (containing a 1) and the columns, you can use loc, selecting the columns using filter (or any other method) and the rows using any:
>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
5 6.8 1 0 5 0
My solution. It may be slower on performance:
a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
Based on @EdChum's answer, you can try the following solution:
df[df.columns[pd.Series(df.columns).str.contains("foo")]]
This will be really helpful in case not all the columns you want to select start with foo. This method selects all the columns that contain the substring foo and it could be placed in at any point of a column's name.
In essence, I replaced .startswith() with .contains().
You can try the regex here to filter out the columns starting with "foo"
df.filter(regex='^foo*')
If you need to have the string foo in your column then
df.filter(regex='foo*')
would be appropriate.
For the next step, you can use
df[df.filter(regex='^foo*').values==1]
to filter out the rows where one of the values of 'foo*' column is 1.