问题
I have a Dataframe and would like to make another column that combines the columns whose name begins with the same value in Answer and QID.
That is to say, having the following Dataframe
QID Category Text QType Question: Answer0 Answer1 Country
0 16 Automotive Access to car Single Do you have access to a car? I own a car/cars I own a car/cars UK
1 16 Automotive Access to car Single Do you have access to a car? I lease/ have a company car I lease/have a company car UK
2 16 Automotive Access to car Single Do you have access to a car? I have access to a car/cars I have access to a car/cars UK
3 16 Automotive Access to car Single Do you have access to a car? No, I don’t have access to a car/cars No, I don't have access to a car UK
4 16 Automotive Access to car Single Do you have access to a car? Prefer not to say Prefer not to say UK
I would like to get the following as a result:
QID Category Text QType Question: Answer0 Answer1 Answer2 Answer3 Country Answers
0 16 Automotive Access to car Single Do you have access to a car? I own a car/cars I lease/ have a company car I have access to a car/cars No, I don’t have access to a car/cars UK ['I own a car/cars', 'I lease/ have a company car' ,'I have access to a car/cars', 'No, I don’t have access to a car/cars', 'Prefer not to say Prefer not to say']
So far I have tried the following:
previous_qid = None
i = 0
j = 0
answers = []
new_row = {}
new_df = pd.DataFrame(columns=df.columns)
for _, row in df.iterrows():
# get QID
qid = row['QID']
if qid == previous_qid:
i+=1
new_row['Answer'+str(i)]=row['Answer0']
answers.append(row['Answer0'])
elif new_row != {}:
# we moved to a new row
new_row['QID'] = qid
new_row['Question'] = row['Question']
new_row['Answers'] = answers
# we create a new row in the new_dataframe
new_df.append(new_row, ignore_index=True)
# we clean up everything to receive the next row
answers = []
i=0
j+=1
new_row = {}
# we add the information of the current row
new_row['Answer'+str(i)]=row['Answer0']
answers.append(row['Answer0'])
previous_qid = qid
But new_df results empty.
回答1:
This is logically grouping by QID getting a list of Answers then splitting list back into columns
import re
data = """ QID Category Text QType Question: Answer0 Answer1 Country
0 16 Automotive Access to car Single Do you have access to a car? I own a car/cars I own a car/cars UK
1 16 Automotive Access to car Single Do you have access to a car? I lease/ have a company car I lease/have a company car UK
2 16 Automotive Access to car Single Do you have access to a car? I have access to a car/cars I have access to a car/cars UK
3 16 Automotive Access to car Single Do you have access to a car? No, I don’t have access to a car/cars No, I don't have access to a car UK
4 16 Automotive Access to car Single Do you have access to a car? Prefer not to say Prefer not to say UK"""
a = [[t.strip() for t in re.split(" ",l) if t!=""] for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]
df = pd.DataFrame(data=a[1:], columns=a[0])
# lazy - want first of all attributes except QID and Answer columns
agg = {col:"first" for col in list(df.columns) if col!="QID" and "Answer" not in col}
# get a list of all answers in Answer0 for a QID
agg = {**agg, **{"Answer0":lambda s: list(s)}}
# helper function for row call. not needed but makes more readable
def ans(r, i):
return "" if i>=len(r["AnswerT"]) else r["AnswerT"][i]
# split list from aggregation back out into columns using assign
# rename Answer0 to AnserT from aggregation so that it can be referred to.
# AnswerT drop it when don't want it any more
dfgrouped = df.groupby("QID").agg(agg).reset_index().rename(columns={"Answer0":"AnswerT"}).assign(
Answer0=lambda dfa: dfa.apply(lambda r: ans(r, 0), axis=1),
Answer1=lambda dfa: dfa.apply(lambda r: ans(r, 1), axis=1),
Answer2=lambda dfa: dfa.apply(lambda r: ans(r, 2), axis=1),
Answer3=lambda dfa: dfa.apply(lambda r: ans(r, 3), axis=1),
Answer4=lambda dfa: dfa.apply(lambda r: ans(r, 4), axis=1),
Answer5=lambda dfa: dfa.apply(lambda r: ans(r, 5), axis=1),
Answer6=lambda dfa: dfa.apply(lambda r: ans(r, 6), axis=1),
).drop("AnswerT", axis=1)
print(dfgrouped.to_string(index=False))
output
QID Category Text QType Question: Country Answer0 Answer1 Answer2 Answer3 Answer4 Answer5 Answer6
16 Automotive Access to car Single Do you have access to a car? UK I own a car/cars I lease/ have a company car I have access to a car/cars No, I don’t have access to a car/cars Prefer not to say
more dynamic
This gets a bit more into advanced python. Use of **kwargs and functools.partial. In reality it's still static, columns is defined as a constant MAXANS
import functools
MAXANS=8
def ansassign(dfa, row=0):
return dfa.apply(lambda r: "" if row>=len(r["AnswerT"]) else r["AnswerT"][row], axis=1)
dfgrouped = df.groupby("QID").agg(agg).reset_index().rename(columns={"Answer0":"AnswerT"}).assign(
**{f"Answer{i}":functools.partial(ansassign, row=i) for i in range(MAXANS)}
).drop("AnswerT", axis=1)
来源:https://stackoverflow.com/questions/63244304/merge-lines-that-share-the-same-key-into-one-line