I have a pandas dataframe with a mix of datatypes (dtypes) that I wish to convert to a numpy structured array (or record array, basically the same thing in this case). For
As far as I am aware, there is no native functionality for this. For example, the maximum length of all values within a series is not stored anywhere.
However, you can implement your logic more efficiently via a list comprehension and f-strings:
data_types = [(col, arr[col].dtype if arr[col].dtype != 'O' else \
f'U{df[col].astype(str).str.len().max()}') for col in arr.dtype.names]
Combining suggestions from @jpp (list comp for conciseness) & @hpaulj (cannibalize to_records
for speed), I came up with the following, which is cleaner code and also about 5x faster than my original code (tested by expanding the sample dataframe above to 10,000 rows):
names = df.columns
arrays = [ df[col].get_values() for col in names ]
formats = [ array.dtype if array.dtype != 'O'
else f'{array.astype(str).dtype}' for array in arrays ]
rec_array = np.rec.fromarrays( arrays, dtype={'names': names, 'formats': formats} )
The above will output unicode rather than strings which is probably better in general but in my case I need to convert to strings because I'm reading the binary file in fortran and strings seem to read in more easily. Hence, it may be better to replace the "formats" line above with this:
formats = [ array.dtype if array.dtype != 'O'
else array.astype(str).dtype.str.replace('<U','S') for array in arrays ]
E.g. a dtype of <U4
becomes S4
.