How to keep column names when converting from pandas to numpy

前端 未结 4 1171
挽巷
挽巷 2021-02-20 08:03

According to this post, I should be able to access the names of columns in an ndarray as a.dtype.names

Howevever, if I convert a pandas DataFrame to an ndarray with df.a

相关标签:
4条回答
  • 2021-02-20 08:16

    Pandas dataframe also has a handy to_records method. Demo:

    X = pd.DataFrame(dict(age=[40., 50., 60.], 
                          sys_blood_pressure=[140.,150.,160.]))
    m = X.to_records(index=False)
    print repr(m)
    

    Returns:

    rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)], 
              dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')])
    

    This is a "record array", which is an ndarray subclass that allows field access using attributes, e.g. m.age in addition to m['age'].

    You can pass this to a cython function as a regular float array by constructing a view:

    m_float = m.view(float).reshape(m.shape + (-1,))
    print repr(m_float)
    

    Which gives:

    rec.array([[  40.,  140.],
               [  50.,  150.],
               [  60.,  160.]], 
              dtype=float64)
    

    Note in order for this to work, the original Dataframe must have a float dtype for every column. To make sure use m = X.astype(float, copy=False).to_records(index=False).

    0 讨论(0)
  • 2021-02-20 08:23

    Consider a DF as shown below:

    X = pd.DataFrame(dict(one=['Strawberry', 'Fields', 'Forever'], two=[1,2,3]))
    X
    

    Provide a list of tuples as data input to the structured array:

    arr_ip = [tuple(i) for i in X.as_matrix()]
    

    Ordered list of field names:

    dtyp = np.dtype(list(zip(X.dtypes.index, X.dtypes)))
    

    Here, X.dtypes.index gives you the column names and X.dtypes it's corresponding dtypes which are unified again into a list of tuples and fed as input to the dtype elements to be constructed.

    arr = np.array(arr_ip, dtype=dtyp)
    

    gives:

    arr
    # array([('Strawberry', 1), ('Fields', 2), ('Forever', 3)], 
    #       dtype=[('one', 'O'), ('two', '<i8')])
    

    and

    arr.dtype.names
    # ('one', 'two')
    
    0 讨论(0)
  • 2021-02-20 08:25

    OK, here where I'm leaning:

    class NDArrayWithColumns(np.ndarray):
        def __new__(cls, obj,  columns=None):
            obj = obj.view(cls)
            obj.columns = columns
            return obj
    
        def __array_finalize__(self, obj):
            if obj is None: return
            self.columns = getattr(obj, 'columns', None)
    
        @staticmethod
        def from_dataframe(df):
            cols = tuple(df.columns)
            arr = df.as_matrix(cols)
            return NDArrayWithColumns.from_array(arr,cols)
    
        @staticmethod
        def from_array(array,columns):
            if isinstance(array,NDArrayWithColumns):
                return array
            return NDArrayWithColumns(array,tuple(columns))
    
        def __str__(self):
            sup = np.ndarray.__str__(self)
            if self.columns:
                header = ", ".join(self.columns)
                header = "# " + header + "\n"
                return header+sup
            return sup
    
    NAN = float("nan")
    X = pd.DataFrame(dict(age=[40., NAN, 60.], sys_blood_pressure=[140.,150.,160.]))
    arr = NDArrayWithColumns.from_dataframe(X)
    print arr
    print arr.columns
    print arr.dtype
    

    Gives:

    # age, sys_blood_pressure
    [[  40.  140.]
     [  nan  150.]
     [  60.  160.]]
    ('age', 'sys_blood_pressure')
    float64
    

    and can also be passed to types cython function expecting a ndarray[2,double_t].

    UPDATE: this works pretty good except for some oddness when passing the type to ufuncs.

    0 讨论(0)
  • Yet more methods of converting a pandas.DataFrame to numpy.array while preserving label/column names

    This is mainly for demonstrating how to set dtype/column_dtypes, because sometimes a data source iterator's output'll need some pre-normalization.


    Method one inserts by column into a zeroed array of predefined height and is loosely based on a Creating Structured Arrays guide that just a bit of web-crawling turned up

    import numpy
    
    
    def to_tensor(dataframe, columns = [], dtypes = {}):
        # Use all columns from data frame if none where listed when called
        if len(columns) <= 0:
            columns = dataframe.columns
        # Build list of dtypes to use, updating from any `dtypes` passed when called
        dtype_list = []
        for column in columns:
            if column not in dtypes.keys():
                dtype_list.append(dataframe[column].dtype)
            else:
                dtype_list.append(dtypes[column])
        # Build dictionary with lists of column names and formatting in the same order
        dtype_dict = {
            'names': columns,
            'formats': dtype_list
        }
        # Initialize _mostly_ empty nupy array with column names and formatting
        numpy_buffer = numpy.zeros(
            shape = len(dataframe),
            dtype = dtype_dict)
        # Insert values from dataframe columns into numpy labels
        for column in columns:
            numpy_buffer[column] = dataframe[column].to_numpy()
        # Return results of conversion
        return numpy_buffer
    

    Method two is based on user7138814's answer and will likely be more efficient as it is basically a wrapper for the built in to_records method available to pandas.DataFrames

    def to_tensor(dataframe, columns = [], dtypes = {}, index = False):
        to_records_kwargs = {'index': index}
        if not columns:  # Default to all `dataframe.columns`
            columns = dataframe.columns
        if dtypes:       # Pull in modifications only for dtypes listed in `columns`
            to_records_kwargs['column_dtypes'] = {}
            for column in dtypes.keys():
                if column in columns:
                    to_records_kwargs['column_dtypes'].update({column: dtypes.get(column)})
        return dataframe[columns].to_records(**to_records_kwargs)
    

    With either of the above one could do...

    X = pandas.DataFrame(dict(age = [40., 50., 60.], sys_blood_pressure = [140., 150., 160.]))
    
    # Example of overwriting dtype for a column
    X_tensor = to_tensor(X, dtypes = {'age': 'int32'})
    
    print("Ages -> {0}".format(X_tensor['age']))
    print("SBPs -> {0}".format(X_tensor['sys_blood_pressure']))
    

    ... which should output...

    Ages -> array([40, 50, 60])
    SBPs -> array([140., 150., 160.])
    

    ... and a full dump of X_tensor should look like the following.

    array([(40, 140.), (50, 150.), (60, 160.)],
          dtype=[('age', '<i4'), ('sys_blood_pressure', '<f8')])
    

    Some thoughts

    While method two will likely be more efficient than the first, method one (with some modifications) may be more useful for merging two or more pandas.DataFrames into one numpy.array.

    Additionally (after swinging back through to review), method one will likely face-plant as it's written with errors about to_records_kwargs not being a mapping if dtypes is not defined, next time I'm feeling Pythonic I may resolve that with an else condition.

    0 讨论(0)
提交回复
热议问题