Splitting a large Pandas Dataframe with minimal memory footprint

后端未结

关注

 3  715

我寻月下人不归 2021-02-04 17:42

I have a large DataFrame, which I would like to split into a test set and a train set for model building. However, I do not want to duplicate the DataFrame because I am reaching

3条回答

春和景丽 (楼主)

2021-02-04 18:21

As other answers are more focused on the file reading, I guess you also can do something, if for any reason your DataFrame isn't read from a file.

Maybe you can take a look at the code of the DataFrame.drop method and modify it in order to modify your DataFrame inplace (which the drop method already do) and get the other raws returned :

class DF(pd.DataFrame):
    def drop(self, labels, axis=0, level=None, inplace=False, errors='raise'):
        axis = self._get_axis_number(axis)
        axis_name = self._get_axis_name(axis)
        axis, axis_ = self._get_axis(axis), axis

        if axis.is_unique:
            if level is not None:
                if not isinstance(axis, pd.MultiIndex):
                    raise AssertionError('axis must be a MultiIndex')
                new_axis = axis.drop(labels, level=level, errors=errors)
            else:
                new_axis = axis.drop(labels, errors=errors)
            dropped = self.reindex(**{axis_name: new_axis})
            try:
                dropped.axes[axis_].set_names(axis.names, inplace=True)
            except AttributeError:
                pass
            result = dropped

        else:
            labels = com._index_labels_to_array(labels)
            if level is not None:
                if not isinstance(axis, MultiIndex):
                    raise AssertionError('axis must be a MultiIndex')
                indexer = ~axis.get_level_values(level).isin(labels)
            else:
                indexer = ~axis.isin(labels)

            slicer = [slice(None)] * self.ndim
            slicer[self._get_axis_number(axis_name)] = indexer

            result = self.ix[tuple(slicer)]

        if inplace:
            dropped = self.ix[labels]
            self._update_inplace(result)
            return dropped
        else:
            return result, self.ix[labels]

Which will work like this:

df = DF({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})

dropped = df.drop(range(5), inplace=True)
# or :
# partA, partB = df.drop(range(5))

This example isn't probably really memory efficient but maybe you can figure out something better by using some kind of object oriented solution like this.

0 讨论(0)

查看其它3个回答