Using defined strings for regex searching with python

问题

I am looking to enhance the script I have below. I am wondering if it is possible to use defined strings such as 'G', 'SG', 'PF', 'PG', 'SF', 'F', 'UTIL', 'C' to search for the Names between them and then use those strings supplied as the name of the column. The issue I have with the current set up is if a name starts with two capitals like the example below it doesn't know the difference. Being able to set the current strings to search for with regex then return the text between them I think would be the next step to improve the function.

Previous Question: Python: Regex or Dictionary

import pandas as pd, numpy as np

dk_cont_lineup_df = pd.DataFrame(data=np.array([['G CJ McCollum SG Donovan Mitchell PF Robert Covington PG Collin Sexton SF Bojan Bogdanovic F Larry Nance Jr. UTIL Trey Lyles C Maxi Kleber'],['UTIL Nikola Vucevic PF Kevin Love F Robert Covington SG Collin Sexton SF Bojan Bogdanovic G Coby White PG RJ Barrett C Larry Nance Jr.']]))
dk_cont_lineup_df.rename(columns={ dk_cont_lineup_df.columns[0]: 'Lineup' }, inplace = True)


def calc_col(col):
    '''This function takes a string,
    finds the upper case letters or words placed as delimeter,
    converts it to a list,
    adds a number to the list elements if recurring.
    Eg. input list :['W','W','W','D','D','G','C','C','UTIL']
    o/p list: ['W1','W2','W3','D1','D2','G','C1','C2','UTIL']
    '''
    col_list = re.findall(" ?([A-Z]+) ", col)
    col_list2 = []
    for i_pos in col_list:
        cnt = col_list.count(i_pos)
        if cnt == 1:
            col_list2.append(i_pos)
        if cnt > 1:
            if i_pos in " ".join(col_list2):
                continue;
            col_list2 += [i_pos+str(k) for k in range(1,cnt+1)] 
    return col_list2


# START OF SPLIT LINEUP INTO SEPERATE COLUMNS
extr_row = dk_cont_lineup_df['Lineup'].replace(to_replace =" ?[A-Z]+ ", value="\n", regex = True) #split the rows on 
df_final = pd.DataFrame(columns = sorted(calc_col(dk_cont_lineup_df['Lineup'].iloc[0]))) #Create an empty data frame df3 with sorted columns
for i_pos in range(len(extr_row)): #traverse all the rows in the original dataframe and append the formatted rows to df3
    df_temp = pd.DataFrame((extr_row.values[i_pos].split("\n")[1:])).T
    df_temp.columns = calc_col(dk_cont_lineup_df['Lineup'].iloc[i_pos])
    df_temp= df_temp[sorted(df_temp)]
    df_final = df_final.append(df_temp)
df_final.reset_index(drop = True, inplace = True)

OUTPUT:

DESIRED OUTPUT:

I would like to use this script for other data that has other strings which would make it much easier to define what I am looking for. As we see from the input dataframe, the position of the Search strings are not in the same order. The script above will put them into order which we can see in the desired output dataframe.

回答1:

We can simply update your regex expression to check if the capitalised word is not directly next to the previous.

r"(?<![A-Z] )\b([A-Z]+) "

Note we have added a negative lookbehind. To not match if the previous word is not [A-Z]

You can find a more in-depth explanation on the above regex here; https://regex101.com/r/j6RbSP/1

You can now update your code to include the new regex patterns, ensure you remember to add r"" in front of the string.

import pandas as pd, numpy as np
import re

dk_cont_lineup_df = pd.DataFrame(data=np.array([['G CJ McCollum SG Donovan Mitchell PF Robert Covington PG Collin Sexton SF Bojan Bogdanovic F Larry Nance Jr. UTIL Trey Lyles C Maxi Kleber'],['UTIL Nikola Vucevic PF Kevin Love F Robert Covington SG Collin Sexton SF Bojan Bogdanovic G Coby White PG RJ Barrett C Larry Nance Jr.']]))
dk_cont_lineup_df.rename(columns={ dk_cont_lineup_df.columns[0]: 'Lineup' }, inplace = True)


def calc_col(col):
    '''This function takes a string,
    finds the upper case letters or words placed as delimeter,
    converts it to a list,
    adds a number to the list elements if recurring.
    Eg. input list :['W','W','W','D','D','G','C','C','UTIL']
    o/p list: ['W1','W2','W3','D1','D2','G','C1','C2','UTIL']
    '''
    col_list = re.findall(r"(?<![A-Z] )\b([A-Z]+) ", col)
    col_list2 = []
    for i_pos in col_list:
        cnt = col_list.count(i_pos)
        if cnt == 1:
            col_list2.append(i_pos)
        if cnt > 1:
            if i_pos in " ".join(col_list2):
                continue;
            col_list2 += [i_pos+str(k) for k in range(1,cnt+1)] 
    return col_list2


extr_row = dk_cont_lineup_df['Lineup'].replace(to_replace =r"(?<![A-Z] )\b([A-Z]+) ", value="\n", regex = True) #split the rows on 
df_final = pd.DataFrame(columns = sorted(calc_col(dk_cont_lineup_df['Lineup'].iloc[0])))

for i_pos in range(len(extr_row)): #traverse all the rows in the original dataframe and append the formatted rows to df3
    df_temp = pd.DataFrame((extr_row.values[i_pos].split("\n")[1:])).T
    df_temp.columns = calc_col(dk_cont_lineup_df['Lineup'].iloc[i_pos])
    df_temp= df_temp[sorted(df_temp)]
    df_final = df_final.append(df_temp)
df_final.reset_index(drop = True, inplace = True)

print(df_final.to_string())

Produces the desired output:

                 C                  F             G                 PF              PG                 SF                 SG             UTIL
0      Maxi Kleber   Larry Nance Jr.   CJ McCollum   Robert Covington   Collin Sexton   Bojan Bogdanovic   Donovan Mitchell       Trey Lyles 
1  Larry Nance Jr.  Robert Covington    Coby White         Kevin Love      RJ Barrett   Bojan Bogdanovic      Collin Sexton   Nikola Vucevic

来源：https://stackoverflow.com/questions/60518454/using-defined-strings-for-regex-searching-with-python

标签

python

regex