How to find the count of consecutive same string values in a pandas dataframe?

别说谁变了你拦得住时间么 提交于 2021-02-19 07:20:26

问题


Assume that we have the following pandas dataframe:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

input:
 col1 col2  start
0  A>G  TCT   1000
1  C>T  ACA   2000
2  C>T  TCA   3000
3  G>T  TCA   4000
4  C>T  GCT   5000
5  A>G  ACT   6000
6  A>G  CTG  10000
7  A>G  ATG  20000
8  C>A  TCT  10000
9  C>T  ACA   2000
10 C>T  TCA   3000
11 C>T  TCA   4000

What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:

output:
 type length  diff
0  C>T  2   1000
1  A>G  3   14000
2  C>T  3   2000

回答1:


With a little setup, you can 100% vectorise this using GroupBy.agg:

aggfunc = {
    'col1': [('type', 'first'), ('length', 'count')], 
    'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]
}

grouper = df.col1.ne(df.col1.shift()).cumsum()

v = df.assign(key=grouper).groupby('key').agg(aggfunc)
v.columns = v.columns.droplevel(0)
v[v['diff'].ne(0)].reset_index(drop=True)

  type  length   diff
0  C>T       2   1000
1  A>G       3  14000
2  C>T       3   2000



回答2:


probably something like the below:

import pandas as pd
from itertools import groupby

df = pd.DataFrame({
    'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],
    'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'], 
    'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})

final = []
pos = 0
for k,g in groupby([row.col1 for n,row in df.iterrows()]):
    glist = [x for x in g]
    first_pos = pos
    last_pos = pos+len(glist)-1
    if len(glist)>1:
        print(glist)
        val = df.iloc[first_pos].col1
        first = df.iloc[first_pos].start
        last = df.iloc[last_pos].start
        final.append({'type':val,'length':len(glist),'diff':last-first})
    pos = last_pos +1
final = pd.DataFrame(final)
print(final)

output:

diff    length  type
0   1000    2   C>T
1   14000   3   A>G
2   2000    3   C>T



回答3:


Here is a two-step solution, first creating an auxiliary column that labels consecutive occurrences of the same string, the then using standard pandas groupby:

# add a group variable
values = df['col1'].values
# get locations where value changes
change = np.zeros(values.size, dtype=bool)
change[1:] = values[:-1] != values[1:]
df['group'] = change.cumsum()  # summing change points yields the label

# do the aggregation
res = (df
 .groupby('group')
 .agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})
 .rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})
)
# filter on more than one consecutive value
res = res[res['length'] > 1]

print(res)

        diff type  length
group                    
1       1000  C>T       2
4      14000  A>G       3
5       2000  C>T       3



回答4:


You can use pandas groupby and more_itertools:

import more_itertools as mit
def f(g):
    result = pd.DataFrame([], columns={'type', 'length', 'diff'})
    tp = g['col1'].iloc[0]
    for group in mit.consecutive_groups(g.index):
        group = list(group)
        if len(group) == 1:
            continue
        cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})
        result = pd.concat([result, cur_df], ignore_index=True)
    return result

df.groupby('col1').apply(f).reset_index(drop=True)


来源:https://stackoverflow.com/questions/53383208/how-to-find-the-count-of-consecutive-same-string-values-in-a-pandas-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!