Merge lines that share the same key into one line

问题

I have a Dataframe and would like to make another column that combines the columns whose name begins with the same value in Answer and QID.

That is to say, having the following Dataframe

    QID     Category    Text    QType   Question:   Answer0     Answer1     Country
0   16  Automotive  Access to car   Single  Do you have access to a car?    I own a car/cars    I own a car/cars  UK
1   16  Automotive  Access to car   Single  Do you have access to a car?    I lease/ have a company car     I lease/have a company car  UK
2   16  Automotive  Access to car   Single  Do you have access to a car?    I have access to a car/cars     I have access to a car/cars     UK
3   16  Automotive  Access to car   Single  Do you have access to a car?    No, I don’t have access to a car/cars   No, I don't have access to a car    UK
4   16  Automotive  Access to car   Single  Do you have access to a car?    Prefer not to say   Prefer not to say   UK

I would like to get the following as a result:

        QID     Category    Text    QType   Question:   Answer0     Answer1     Answer2    Answer3  Country    Answers
    0   16  Automotive  Access to car   Single  Do you have access to a car?    I own a car/cars    I lease/ have a company car      I have access to a car/cars    No, I don’t have access to a car/cars    UK    ['I own a car/cars', 'I lease/ have a company car'   ,'I have access to a car/cars', 'No, I don’t have access to a car/cars', 'Prefer not to say     Prefer not to say']

So far I have tried the following:

previous_qid = None
i = 0
j = 0
answers = []
new_row = {}
new_df = pd.DataFrame(columns=df.columns)
for _, row in df.iterrows():
    # get QID
    qid = row['QID']
    if qid == previous_qid:
        i+=1
        new_row['Answer'+str(i)]=row['Answer0']
        answers.append(row['Answer0'])
    elif new_row != {}:
        # we moved to a new row
        new_row['QID'] = qid
        new_row['Question'] = row['Question']
        new_row['Answers'] = answers
        # we create a new row in the new_dataframe
        new_df.append(new_row, ignore_index=True)
        # we clean up everything to receive the next row
        answers = []
        i=0
        j+=1
        new_row = {}
        # we add the information of the current row
        new_row['Answer'+str(i)]=row['Answer0']
        answers.append(row['Answer0'])
    previous_qid = qid

But new_df results empty.

回答1:

This is logically grouping by QID getting a list of Answers then splitting list back into columns

import re
data = """    QID     Category    Text    QType   Question:   Answer0     Answer1     Country
0   16  Automotive  Access to car   Single  Do you have access to a car?    I own a car/cars    I own a car/cars  UK
1   16  Automotive  Access to car   Single  Do you have access to a car?    I lease/ have a company car     I lease/have a company car  UK
2   16  Automotive  Access to car   Single  Do you have access to a car?    I have access to a car/cars     I have access to a car/cars     UK
3   16  Automotive  Access to car   Single  Do you have access to a car?    No, I don’t have access to a car/cars   No, I don't have access to a car    UK
4   16  Automotive  Access to car   Single  Do you have access to a car?    Prefer not to say   Prefer not to say   UK"""
a = [[t.strip() for t in re.split("  ",l) if t!=""]  for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]

df = pd.DataFrame(data=a[1:], columns=a[0])

# lazy - want first of all attributes except QID and Answer columns
agg = {col:"first" for col in list(df.columns) if col!="QID" and "Answer" not in col}
# get a list of all answers in Answer0 for a QID
agg = {**agg, **{"Answer0":lambda s: list(s)}}

# helper function for row call.  not needed but makes more readable
def ans(r, i):
    return "" if i>=len(r["AnswerT"]) else r["AnswerT"][i]

# split list from aggregation back out into columns using assign
# rename Answer0 to AnserT from aggregation so that it can be referred to.  
# AnswerT drop it when don't want it any more
dfgrouped = df.groupby("QID").agg(agg).reset_index().rename(columns={"Answer0":"AnswerT"}).assign(
    Answer0=lambda dfa: dfa.apply(lambda r: ans(r, 0), axis=1),
    Answer1=lambda dfa: dfa.apply(lambda r: ans(r, 1), axis=1),
    Answer2=lambda dfa: dfa.apply(lambda r: ans(r, 2), axis=1),
    Answer3=lambda dfa: dfa.apply(lambda r: ans(r, 3), axis=1),
    Answer4=lambda dfa: dfa.apply(lambda r: ans(r, 4), axis=1),
    Answer5=lambda dfa: dfa.apply(lambda r: ans(r, 5), axis=1),
    Answer6=lambda dfa: dfa.apply(lambda r: ans(r, 6), axis=1),
).drop("AnswerT", axis=1)

print(dfgrouped.to_string(index=False))

output

QID    Category           Text   QType                     Question: Country           Answer0                      Answer1                      Answer2                                Answer3            Answer4 Answer5 Answer6
 16  Automotive  Access to car  Single  Do you have access to a car?      UK  I own a car/cars  I lease/ have a company car  I have access to a car/cars  No, I don’t have access to a car/cars  Prefer not to say

more dynamic

This gets a bit more into advanced python. Use of **kwargs and functools.partial. In reality it's still static, columns is defined as a constant MAXANS

import functools 
MAXANS=8
def ansassign(dfa, row=0):
    return dfa.apply(lambda r: "" if row>=len(r["AnswerT"]) else r["AnswerT"][row], axis=1)
dfgrouped = df.groupby("QID").agg(agg).reset_index().rename(columns={"Answer0":"AnswerT"}).assign(
    **{f"Answer{i}":functools.partial(ansassign, row=i) for i in range(MAXANS)}
).drop("AnswerT", axis=1)

来源：https://stackoverflow.com/questions/63244304/merge-lines-that-share-the-same-key-into-one-line

标签

python

python-3.x

pandas

dataframe