I have a large DataFrame, which I would like to split into a test set and a train set for model building. However, I do not want to duplicate the DataFrame because I am reaching
As other answers are more focused on the file reading, I guess you also can do something, if for any reason your DataFrame isn't read from a file.
Maybe you can take a look at the code of the DataFrame.drop method and modify it in order to modify your DataFrame inplace (which the drop
method already do) and get the other raws returned :
class DF(pd.DataFrame):
def drop(self, labels, axis=0, level=None, inplace=False, errors='raise'):
axis = self._get_axis_number(axis)
axis_name = self._get_axis_name(axis)
axis, axis_ = self._get_axis(axis), axis
if axis.is_unique:
if level is not None:
if not isinstance(axis, pd.MultiIndex):
raise AssertionError('axis must be a MultiIndex')
new_axis = axis.drop(labels, level=level, errors=errors)
else:
new_axis = axis.drop(labels, errors=errors)
dropped = self.reindex(**{axis_name: new_axis})
try:
dropped.axes[axis_].set_names(axis.names, inplace=True)
except AttributeError:
pass
result = dropped
else:
labels = com._index_labels_to_array(labels)
if level is not None:
if not isinstance(axis, MultiIndex):
raise AssertionError('axis must be a MultiIndex')
indexer = ~axis.get_level_values(level).isin(labels)
else:
indexer = ~axis.isin(labels)
slicer = [slice(None)] * self.ndim
slicer[self._get_axis_number(axis_name)] = indexer
result = self.ix[tuple(slicer)]
if inplace:
dropped = self.ix[labels]
self._update_inplace(result)
return dropped
else:
return result, self.ix[labels]
Which will work like this:
df = DF({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})
dropped = df.drop(range(5), inplace=True)
# or :
# partA, partB = df.drop(range(5))
This example isn't probably really memory efficient but maybe you can figure out something better by using some kind of object oriented solution like this.