问题
I have a Dataframe and would like to make another column that combines the columns whose name begins with the same value
in Answer
and QID
.
That is to say, having the following Dataframe
QID Category Text QType Question: Answer0 Answer1 Country
0 16 Automotive Access to car Single Do you have access to a car? I own a car/cars I own a car/cars UK
1 16 Automotive Access to car Single Do you have access to a car? I lease/ have a company car I lease/have a company car UK
2 16 Automotive Access to car Single Do you have access to a car? I have access to a car/cars I have access to a car/cars UK
3 16 Automotive Access to car Single Do you have access to a car? No, I don’t have access to a car/cars No, I don't have access to a car UK
4 16 Automotive Access to car Single Do you have access to a car? Prefer not to say Prefer not to say UK
I would like to get the following as a result:
QID Category Text QType Question: Answer0 Answer1 Answer2 Answer3 Country Answers
0 16 Automotive Access to car Single Do you have access to a car? I own a car/cars I lease/ have a company car I have access to a car/cars No, I don’t have access to a car/cars UK ['I own a car/cars', 'I lease/ have a company car' ,'I have access to a car/cars', 'No, I don’t have access to a car/cars', 'Prefer not to say Prefer not to say']
So far I have tried the following:
previous_qid = None
i = 0
j = 0
answers = []
new_row = {}
new_df = pd.DataFrame(columns=df.columns)
for _, row in df.iterrows():
# get QID
qid = row['QID']
if qid == previous_qid:
i+=1
new_row['Answer'+str(i)]=row['Answer0']
answers.append(row['Answer0'])
elif new_row != {}:
# we moved to a new row
new_row['QID'] = qid
new_row['Question'] = row['Question']
new_row['Answers'] = answers
# we create a new row in the new_dataframe
new_df.append(new_row, ignore_index=True)
# we clean up everything to receive the next row
answers = []
i=0
j+=1
new_row = {}
# we add the information of the current row
new_row['Answer'+str(i)]=row['Answer0']
answers.append(row['Answer0'])
previous_qid = qid
But new_df
results empty.
回答1:
This is logically grouping by QID getting a list of Answers then splitting list back into columns
import re
data = """ QID Category Text QType Question: Answer0 Answer1 Country
0 16 Automotive Access to car Single Do you have access to a car? I own a car/cars I own a car/cars UK
1 16 Automotive Access to car Single Do you have access to a car? I lease/ have a company car I lease/have a company car UK
2 16 Automotive Access to car Single Do you have access to a car? I have access to a car/cars I have access to a car/cars UK
3 16 Automotive Access to car Single Do you have access to a car? No, I don’t have access to a car/cars No, I don't have access to a car UK
4 16 Automotive Access to car Single Do you have access to a car? Prefer not to say Prefer not to say UK"""
a = [[t.strip() for t in re.split(" ",l) if t!=""] for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]
df = pd.DataFrame(data=a[1:], columns=a[0])
# lazy - want first of all attributes except QID and Answer columns
agg = {col:"first" for col in list(df.columns) if col!="QID" and "Answer" not in col}
# get a list of all answers in Answer0 for a QID
agg = {**agg, **{"Answer0":lambda s: list(s)}}
# helper function for row call. not needed but makes more readable
def ans(r, i):
return "" if i>=len(r["AnswerT"]) else r["AnswerT"][i]
# split list from aggregation back out into columns using assign
# rename Answer0 to AnserT from aggregation so that it can be referred to.
# AnswerT drop it when don't want it any more
dfgrouped = df.groupby("QID").agg(agg).reset_index().rename(columns={"Answer0":"AnswerT"}).assign(
Answer0=lambda dfa: dfa.apply(lambda r: ans(r, 0), axis=1),
Answer1=lambda dfa: dfa.apply(lambda r: ans(r, 1), axis=1),
Answer2=lambda dfa: dfa.apply(lambda r: ans(r, 2), axis=1),
Answer3=lambda dfa: dfa.apply(lambda r: ans(r, 3), axis=1),
Answer4=lambda dfa: dfa.apply(lambda r: ans(r, 4), axis=1),
Answer5=lambda dfa: dfa.apply(lambda r: ans(r, 5), axis=1),
Answer6=lambda dfa: dfa.apply(lambda r: ans(r, 6), axis=1),
).drop("AnswerT", axis=1)
print(dfgrouped.to_string(index=False))
output
QID Category Text QType Question: Country Answer0 Answer1 Answer2 Answer3 Answer4 Answer5 Answer6
16 Automotive Access to car Single Do you have access to a car? UK I own a car/cars I lease/ have a company car I have access to a car/cars No, I don’t have access to a car/cars Prefer not to say
more dynamic
This gets a bit more into advanced python
. Use of **kwargs
and functools.partial
. In reality it's still static, columns is defined as a constant MAXANS
import functools
MAXANS=8
def ansassign(dfa, row=0):
return dfa.apply(lambda r: "" if row>=len(r["AnswerT"]) else r["AnswerT"][row], axis=1)
dfgrouped = df.groupby("QID").agg(agg).reset_index().rename(columns={"Answer0":"AnswerT"}).assign(
**{f"Answer{i}":functools.partial(ansassign, row=i) for i in range(MAXANS)}
).drop("AnswerT", axis=1)
来源:https://stackoverflow.com/questions/63244304/merge-lines-that-share-the-same-key-into-one-line