combine or iterate pandas rows on specific columns

懵懂的女人 提交于 2019-12-12 04:58:27

问题


I am struggling to figure this row by row iteration out in pandas.

I have a dataset that contains chat conversations between 2 parties. I would like to combine the dataset to row by row conversation between Person 1 and Person 2. Sometimes people will type in multiple sentences and these will appear as multiple records within the dataframe.

This is the loop that I have come back with:

  1. line_text to be combined
  2. timestamp to be updated with the latest time
  3. if the line_by show that the same person typed in multiple lines and sent through their chat
  4. since there are multiple id's in this dataset signifying each conversation record between person 1 and person 2, i would like the loop to be run by each unique id.

    id    timestamp line_by line_text
    1234    02:54.3 Person1 Text Line 1
    1234    03:23.8 Person2 Text Line 2
    1234    03:47.0 Person2 Text Line 3
    1234    04:46.8 Person1 Text Line 4
    1234    05:46.2 Person1 Text Line 5
    9876    06:44.5 Person2 Text Line 6
    9876    07:27.6 Person1 Text Line 7
    9876    08:17.5 Person2 Text Line 8
    9876    10:20.3 Person2 Text Line 9
    

I would like to see the data to be changed to the following:

id    timestamp line_by line_text
1234    02:54.3 Person1 Text Line 1
1234    03:47.0 Person2 Text Line 2Text Line 3
1234    05:46.2 Person1 Text Line 4Text Line 5
9876    06:44.5 Person2 Text Line 6
9876    07:27.6 Person1 Text Line 7
9876    10:20.3 Person2 Text Line 8Text Line 9

Any ideas are appreciated.


回答1:


You could groupby on consecutive line_by and the using agg aggregate for lastest timestamp, and ''.join line_text

In [1918]: (df.groupby((df.line_by != df.line_by.shift()).cumsum(), as_index=False)
              .agg({'id': 'first', 'timestamp': 'last', 'line_by': 'first',
                   'line_text': ''.join}))
Out[1918]:
  timestamp               line_text    id  line_by
0   02:54.3             Text Line 1  1234  Person1
1   03:47.0  Text Line 2Text Line 3  1234  Person2
2   05:46.2  Text Line 4Text Line 5  1234  Person1
3   06:44.5             Text Line 6  9876  Person2
4   07:27.6             Text Line 7  9876  Person1
5   10:20.3  Text Line 8Text Line 9  9876  Person2

Details

In [1919]: (df.line_by != df.line_by.shift()).cumsum()
Out[1919]:
0    1
1    2
2    2
3    3
4    3
5    4
6    5
7    6
8    6
Name: line_by, dtype: int32

In [1920]: df
Out[1920]:
     id timestamp  line_by    line_text
0  1234   02:54.3  Person1  Text Line 1
1  1234   03:23.8  Person2  Text Line 2
2  1234   03:47.0  Person2  Text Line 3
3  1234   04:46.8  Person1  Text Line 4
4  1234   05:46.2  Person1  Text Line 5
5  9876   06:44.5  Person2  Text Line 6
6  9876   07:27.6  Person1  Text Line 7
7  9876   08:17.5  Person2  Text Line 8
8  9876   10:20.3  Person2  Text Line 9


来源:https://stackoverflow.com/questions/46334930/combine-or-iterate-pandas-rows-on-specific-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!