问题
Given a toy dataset as follows:
id room area situation
0 1 A-102 world under construction
1 2 NaN 24 under construction
2 3 B309 NaN NaN
3 4 C·102 25 under decoration
4 5 E_1089 hello under decoration
5 6 27 NaN under plan
6 7 27 NaN NaN
I need to check three columns: room, area, situation based on the following conditions:
(1) if room name is not number, alphabet, - (NaNs are also considered as invalid one), then returns incorrect room name for check column;
(2) if area is not number or NaNs, then returns and appends area is not numbers to the existing check column.
(3) if situation has under decoration, then returns and appends decoration is in the content to the existing check column.
Please note I have other columns to check in real data and I need to append new check results by seperators ;.
How could I get the expected result like this:
id room area situation check
0 1 A-102 world under construction area is not numbers
1 2 NaN 24 under construction incorrect room name
2 3 B309 NaN NaN NaN
3 4 C·102 25 under decoration incorrect room name; decoration is in the content
4 5 E_1089 hello under decoration incorrect room name; area is not numbers; decoration is in the content
5 6 27 NaN under plan NaN
6 7 27 NaN NaN NaN
My code so far:
Room name check:
df['check'] = np.where(df.room.str.match('^[a-zA-Z\d\-]*$'), np.NaN, 'incorrect room name')
Out:
id room area situation check
0 1 A-102 world under construction nan
1 2 NaN 24 under construction nan
2 3 B309 NaN NaN nan
3 4 C·102 25 under decoration incorrect room name
4 5 E_1089 hello under decoration incorrect room name
5 6 27 NaN under plan nan
6 7 27 NaN NaN nan
Area check:
df['check'] = df['check'].where(df.area.str.contains('^\d+$', na = True),
'area is not a numbers')
Out:
id room area situation check
0 1 A-102 world under construction area is not a numbers
1 2 NaN 24 under construction nan
2 3 B309 NaN NaN nan
3 4 C·102 25 under decoration incorrect room name
4 5 E_1089 hello under decoration area is not a numbers
5 6 27 NaN under plan nan
6 7 27 NaN NaN nan
Situation check:
df['check'] = df['check'].where(df.situation.str.contains('under decoration', na = True),
'decoration is in the content')
Out:
id room area situation check
0 1 A-102 world under construction decoration is in the content
1 2 NaN 24 under construction decoration is in the content
2 3 B309 NaN NaN nan
3 4 C·102 25 under decoration incorrect room name
4 5 E_1089 hello under decoration area is not a numbers
5 6 27 NaN under plan decoration is in the content
6 7 27 NaN NaN nan
Thanks.
回答1:
First was changed output from each test by numpy.where, then zip each array and apply custom function for join if no missing value:
a = np.where(df.room.str.match('^[a-zA-Z\d\-]*$', na = False), None,
'incorrect room name')
b = np.where(df.area.str.contains('^\d+$', na = True), None,
'area is not a numbers')
c = np.where(df.situation.str.contains('under decoration', na = False),
'decoration is in the content', None)
f = (lambda x: ';'.join(y for y in x if pd.notna(y))
if any(pd.notna(np.array(x))) else np.nan )
df['check'] = [f(x) for x in zip(a,b,c)]
print(df)
id room area situation \
0 1 A-102 world under construction
1 2 NaN 24 under construction
2 3 B309 NaN NaN
3 4 C·102 25 under decoration
4 5 E_1089 hello under decoration
5 6 27 NaN under plan
6 7 27 NaN NaN
check
0 area is not a numbers
1 incorrect room name
2 NaN
3 incorrect room name;decoration is in the content
4 incorrect room name;area is not a numbers;deco...
5 NaN
6 NaN
回答2:
I reworked your conditions a bit so the result comes closer to your expected output:
a = np.where(df.room.str.match('^[a-zA-Z\d\-]*$').notnull(), pd.NA, 'incorrect room name')
b = np.where(df["area"].str.isnumeric() & df["area"].notnull(), pd.NA, 'area is not a numbers')
c = np.where(df.situation.str.contains('under decoration', na = False), 'decoration is in the content', pd.NA)
s = (pd.concat([pd.Series(i, index=df.index) for i in (a, b, c)], axis = 1)
.stack().groupby(level = 0).agg("; ".join))
print(df.assign(check=s))
id room area situation check
0 1 A-102 world under construction area is not a numbers
1 2 NaN 24 under construction incorrect room name
2 3 B309 NaN NaN area is not a numbers; decoration is in the co...
3 4 C·102 25 under decoration decoration is in the content
4 5 E_1089 hello under decoration area is not a numbers; decoration is in the co...
5 6 27 NaN under plan area is not a numbers
6 7 27 NaN NaN area is not a numbers; decoration is in the co...
回答3:
You can try this:
import os
import glob
import pandas as pd
os.chdir(r"C:\Users\Rameez PC\Desktop\python data files 2\")
extension = 'xlsx'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_xlsx1 = pd.concat([pd.read_excel(f) for f in all_filenames] )
#export to csv
combined_xlsx1.to_excel( "combined.xlsx", index=False, encoding='utf-8-sig')
来源:https://stackoverflow.com/questions/64746705/check-multiple-columns-data-format-and-append-results-to-one-column-in-pandas