Splitting a large Pandas Dataframe with minimal memory footprint

后端 未结 3 715
我寻月下人不归
我寻月下人不归 2021-02-04 17:42

I have a large DataFrame, which I would like to split into a test set and a train set for model building. However, I do not want to duplicate the DataFrame because I am reaching

3条回答
  •  春和景丽
    2021-02-04 18:21

    As other answers are more focused on the file reading, I guess you also can do something, if for any reason your DataFrame isn't read from a file.

    Maybe you can take a look at the code of the DataFrame.drop method and modify it in order to modify your DataFrame inplace (which the drop method already do) and get the other raws returned :

    class DF(pd.DataFrame):
        def drop(self, labels, axis=0, level=None, inplace=False, errors='raise'):
            axis = self._get_axis_number(axis)
            axis_name = self._get_axis_name(axis)
            axis, axis_ = self._get_axis(axis), axis
    
            if axis.is_unique:
                if level is not None:
                    if not isinstance(axis, pd.MultiIndex):
                        raise AssertionError('axis must be a MultiIndex')
                    new_axis = axis.drop(labels, level=level, errors=errors)
                else:
                    new_axis = axis.drop(labels, errors=errors)
                dropped = self.reindex(**{axis_name: new_axis})
                try:
                    dropped.axes[axis_].set_names(axis.names, inplace=True)
                except AttributeError:
                    pass
                result = dropped
    
            else:
                labels = com._index_labels_to_array(labels)
                if level is not None:
                    if not isinstance(axis, MultiIndex):
                        raise AssertionError('axis must be a MultiIndex')
                    indexer = ~axis.get_level_values(level).isin(labels)
                else:
                    indexer = ~axis.isin(labels)
    
                slicer = [slice(None)] * self.ndim
                slicer[self._get_axis_number(axis_name)] = indexer
    
                result = self.ix[tuple(slicer)]
    
            if inplace:
                dropped = self.ix[labels]
                self._update_inplace(result)
                return dropped
            else:
                return result, self.ix[labels]
    

    Which will work like this:

    df = DF({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})
    
    dropped = df.drop(range(5), inplace=True)
    # or :
    # partA, partB = df.drop(range(5))
    

    This example isn't probably really memory efficient but maybe you can figure out something better by using some kind of object oriented solution like this.

提交回复
热议问题