Find length of longest string in Pandas dataframe column

前端 未结 5 658
忘掉有多难
忘掉有多难 2020-12-23 13:18

Is there a faster way to find the length of the longest string in a Pandas DataFrame than what\'s shown in the example below?

import numpy as np
import panda         


        
相关标签:
5条回答
  • 2020-12-23 13:52

    Just as a minor addition, you might want to loop through all object columns in a data frame:

    for c in df:
        if df[c].dtype == 'object':
            print('Max length of column %s: %s\n' %  (c, df[c].map(len).max()))
    

    This will prevent errors being thrown by bool, int types etc.

    Could be expanded for other non-numeric types such as 'string_', 'unicode_' i.e.

    if df[c].dtype in ('object', 'string_', 'unicode_'):
    
    0 讨论(0)
  • 2020-12-23 13:57

    Excellent answers, in particular Marius and Ricky which were very helpful.

    Given that most of us are optimising for coding time, here is a quick extension to those answers to return all the columns' max item length as a series, sorted by the maximum item length per column:

    mx_dct = {c: df[c].map(lambda x: len(str(x))).max() for c in df.columns}
    pd.Series(mx_dct).sort_values(ascending =False)
    

    Or as a one liner:

    pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df).sort_values(ascending =False)
    

    Adapting the original sample, this can be demoed as:

    import pandas as pd
    
    x = [['ab', 'bcd'], ['dfe', 'efghik']]
    df = pd.DataFrame(x, columns=['col1','col2'])
    
    print(pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df}).sort_values(ascending =False))
    

    Output:

    col2    6
    col1    3
    dtype: int64
    
    0 讨论(0)
  • 2020-12-23 13:59

    Sometimes you want the length of the longest string in bytes. This is relevant for strings that use fancy Unicode characters, in which case the length in bytes is greater than the regular length. This can be very relevant in specific situations, e.g. for database writes.

    df_col_len = int(df[df_col_name].str.encode(encoding='utf-8').str.len().max())
    

    The above line has the extra str.encode(encoding='utf-8'). The output is enclosed in int() because it is otherwise a numpy object.

    0 讨论(0)
  • 2020-12-23 14:00

    You should try using numpy. This could also help you get efficiency improvements.

    The below code will give you maximum lengths for each column in an excel spreadsheet (read into a dataframe using pandas)

    import pandas as pd
    import numpy as np
    
    xl = pd.ExcelFile('sample.xlsx')
    df = xl.parse('Sheet1')
    
    columnLenghts = np.vectorize(len)
    maxColumnLenghts = columnLenghts(df.values.astype(str)).max(axis=0)
    print('Max Column Lengths ', maxColumnLenghts)
    
    0 讨论(0)
  • 2020-12-23 14:06

    DSM's suggestion seems to be about the best you're going to get without doing some manual microoptimization:

    %timeit -n 100 df.col1.str.len().max()
    100 loops, best of 3: 11.7 ms per loop
    
    %timeit -n 100 df.col1.map(lambda x: len(x)).max()
    100 loops, best of 3: 16.4 ms per loop
    
    %timeit -n 100 df.col1.map(len).max()
    100 loops, best of 3: 10.1 ms per loop
    

    Note that explicitly using the str.len() method doesn't seem to be much of an improvement. If you're not familiar with IPython, which is where that very convenient %timeit syntax comes from, I'd definitely suggest giving it a shot for quick testing of things like this.

    Update Added screenshot:

    0 讨论(0)
提交回复
热议问题