Convert dataframe to a rec array (and objects to strings)

后端 未结 2 1069
失恋的感觉
失恋的感觉 2020-12-22 01:30

I have a pandas dataframe with a mix of datatypes (dtypes) that I wish to convert to a numpy structured array (or record array, basically the same thing in this case). For

相关标签:
2条回答
  • 2020-12-22 02:10

    As far as I am aware, there is no native functionality for this. For example, the maximum length of all values within a series is not stored anywhere.

    However, you can implement your logic more efficiently via a list comprehension and f-strings:

    data_types = [(col, arr[col].dtype if arr[col].dtype != 'O' else \
                   f'U{df[col].astype(str).str.len().max()}') for col in arr.dtype.names]
    
    0 讨论(0)
  • 2020-12-22 02:19

    Combining suggestions from @jpp (list comp for conciseness) & @hpaulj (cannibalize to_records for speed), I came up with the following, which is cleaner code and also about 5x faster than my original code (tested by expanding the sample dataframe above to 10,000 rows):

    names = df.columns
    arrays = [ df[col].get_values() for col in names ]
    
    formats = [ array.dtype if array.dtype != 'O' 
                else f'{array.astype(str).dtype}' for array in arrays ] 
    
    rec_array = np.rec.fromarrays( arrays, dtype={'names': names, 'formats': formats} )
    

    The above will output unicode rather than strings which is probably better in general but in my case I need to convert to strings because I'm reading the binary file in fortran and strings seem to read in more easily. Hence, it may be better to replace the "formats" line above with this:

    formats = [ array.dtype if array.dtype != 'O' 
                else array.astype(str).dtype.str.replace('<U','S') for array in arrays ]
    

    E.g. a dtype of <U4 becomes S4.

    0 讨论(0)
提交回复
热议问题