问题
Given a toy dataset as follows:
id room area situation
0 1 A-102 world under construction
1 2 NaN 24 under construction
2 3 B309 NaN NaN
3 4 C·102 25 under decoration
4 5 E_1089 hello under decoration
5 6 27 NaN under plan
6 7 27 NaN NaN
I need to check three columns: room, area, situation
based on the following conditions:
(1) if room
name is not number, alphabet, -
(NaN
s are also considered as invalid one), then returns incorrect room name
for check
column;
(2) if area
is not number
or NaN
s, then returns and appends area is not numbers
to the existing check
column.
(3) if situation
has under decoration
, then returns and appends decoration is in the content
to the existing check
column.
Please note I have other columns to check in real data and I need to append new check
results by seperators ;
.
How could I get the expected result like this:
id room area situation check
0 1 A-102 world under construction area is not numbers
1 2 NaN 24 under construction incorrect room name
2 3 B309 NaN NaN NaN
3 4 C·102 25 under decoration incorrect room name; decoration is in the content
4 5 E_1089 hello under decoration incorrect room name; area is not numbers; decoration is in the content
5 6 27 NaN under plan NaN
6 7 27 NaN NaN NaN
My code so far:
Room name check:
df['check'] = np.where(df.room.str.match('^[a-zA-Z\d\-]*$'), np.NaN, 'incorrect room name')
Out:
id room area situation check
0 1 A-102 world under construction nan
1 2 NaN 24 under construction nan
2 3 B309 NaN NaN nan
3 4 C·102 25 under decoration incorrect room name
4 5 E_1089 hello under decoration incorrect room name
5 6 27 NaN under plan nan
6 7 27 NaN NaN nan
Area check:
df['check'] = df['check'].where(df.area.str.contains('^\d+$', na = True),
'area is not a numbers')
Out:
id room area situation check
0 1 A-102 world under construction area is not a numbers
1 2 NaN 24 under construction nan
2 3 B309 NaN NaN nan
3 4 C·102 25 under decoration incorrect room name
4 5 E_1089 hello under decoration area is not a numbers
5 6 27 NaN under plan nan
6 7 27 NaN NaN nan
Situation check:
df['check'] = df['check'].where(df.situation.str.contains('under decoration', na = True),
'decoration is in the content')
Out:
id room area situation check
0 1 A-102 world under construction decoration is in the content
1 2 NaN 24 under construction decoration is in the content
2 3 B309 NaN NaN nan
3 4 C·102 25 under decoration incorrect room name
4 5 E_1089 hello under decoration area is not a numbers
5 6 27 NaN under plan decoration is in the content
6 7 27 NaN NaN nan
Thanks.
回答1:
First was changed output from each test by numpy.where, then zip
each array and apply custom function for join if no missing value:
a = np.where(df.room.str.match('^[a-zA-Z\d\-]*$', na = False), None,
'incorrect room name')
b = np.where(df.area.str.contains('^\d+$', na = True), None,
'area is not a numbers')
c = np.where(df.situation.str.contains('under decoration', na = False),
'decoration is in the content', None)
f = (lambda x: ';'.join(y for y in x if pd.notna(y))
if any(pd.notna(np.array(x))) else np.nan )
df['check'] = [f(x) for x in zip(a,b,c)]
print(df)
id room area situation \
0 1 A-102 world under construction
1 2 NaN 24 under construction
2 3 B309 NaN NaN
3 4 C·102 25 under decoration
4 5 E_1089 hello under decoration
5 6 27 NaN under plan
6 7 27 NaN NaN
check
0 area is not a numbers
1 incorrect room name
2 NaN
3 incorrect room name;decoration is in the content
4 incorrect room name;area is not a numbers;deco...
5 NaN
6 NaN
回答2:
I reworked your conditions a bit so the result comes closer to your expected output:
a = np.where(df.room.str.match('^[a-zA-Z\d\-]*$').notnull(), pd.NA, 'incorrect room name')
b = np.where(df["area"].str.isnumeric() & df["area"].notnull(), pd.NA, 'area is not a numbers')
c = np.where(df.situation.str.contains('under decoration', na = False), 'decoration is in the content', pd.NA)
s = (pd.concat([pd.Series(i, index=df.index) for i in (a, b, c)], axis = 1)
.stack().groupby(level = 0).agg("; ".join))
print(df.assign(check=s))
id room area situation check
0 1 A-102 world under construction area is not a numbers
1 2 NaN 24 under construction incorrect room name
2 3 B309 NaN NaN area is not a numbers; decoration is in the co...
3 4 C·102 25 under decoration decoration is in the content
4 5 E_1089 hello under decoration area is not a numbers; decoration is in the co...
5 6 27 NaN under plan area is not a numbers
6 7 27 NaN NaN area is not a numbers; decoration is in the co...
回答3:
You can try this:
import os
import glob
import pandas as pd
os.chdir(r"C:\Users\Rameez PC\Desktop\python data files 2\")
extension = 'xlsx'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_xlsx1 = pd.concat([pd.read_excel(f) for f in all_filenames] )
#export to csv
combined_xlsx1.to_excel( "combined.xlsx", index=False, encoding='utf-8-sig')
来源:https://stackoverflow.com/questions/64746705/check-multiple-columns-data-format-and-append-results-to-one-column-in-pandas