I have a file with 4 columns and thousands of rows. I want to remove rows whose items in the first column are in a certain range. For example, if the data in my file is as f
If you want to keep using numpy
, the solution isn't hard.
data = data[np.logical_not(np.logical_and(data[:,0] > 20, data[:,0] < 25))]
data = data[np.logical_not(np.logical_and(data[:,0] > 30, data[:,0] < 35))]
Or if you want to combine it all into one statement,
data = data[
np.logical_not(np.logical_or(
np.logical_and(data[:,0] > 20, data[:,0] < 25),
np.logical_and(data[:,0] > 30, data[:,0] < 35)
))
]
To explain, conditional statements like data[:,0] < 25
create boolean arrays that track, element-by-element, where the condition in an array is true or false. In this case, it tells you where the first column of data is less than 25.
You can also index numpy arrays with these boolean arrays. A statement like data[data[:,0] > 30]
extracts all the rows where data[:,0] > 30
is true, or all the rows where the first element is greater than 30. This kind of conditional indexing is how you extract the rows (or columns, or elements) that you want.
Finally, we need logical tools to combine boolean arrays element-by-element. Regular and
, or
, and not
statements don't work because they try to combine the boolean arrays together as a whole. Fortunately, numpy provides a set of these tools for use in the form of np.logical_and
, np.logical_or
, and np.logical_not
. With these, we can combine our boolean arrays element-wise to find rows that satisfy more complicated conditions.