Select every nth row as a Pandas DataFrame without reading the entire file

问题

I am reading a large file that contains ~9.5 million rows x 16 cols.

I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.

I am able to load the data, and then select every 500th row.

My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.

Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):

VendorID    tpep_pickup_datetime    tpep_dropoff_datetime   passenger_count trip_distance   RatecodeID  store_and_fwd_flag  PULocationID    DOLocationID    payment_type    fare_amount extra   mta_tax tip_amount  tolls_amount    improvement_surcharge   total_amount
0   1   2017-01-09 11:13:28 2017-01-09 11:25:45 1   3.30    1   N   263 161 1   12.5    0.0 0.5 2.00    0.00    0.3 15.30
1   1   2017-01-09 11:32:27 2017-01-09 11:36:01 1   0.90    1   N   186 234 1   5.0 0.0 0.5 1.45    0.00    0.3 7.25
2   1   2017-01-09 11:38:20 2017-01-09 11:42:05 1   1.10    1   N   164 161 1   5.5 0.0 0.5 1.00    0.00    0.3 7.30
3   1   2017-01-09 11:52:13 2017-01-09 11:57:36 1   1.10    1   N   236 75  1   6.0 0.0 0.5 1.70    0.00    0.3 8.50
4   2   2017-01-01 00:00:00 2017-01-01 00:00:00 1   0.02    2   N   249 234 2   52.0    0.0 0.5 0.00    0.00    0.3 52.80

回答1:

Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:

n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)

回答2:

Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

First get length of file by custom function, remove each 500 row by numpy.setdiff1d and pass to skiprows parameter in read_csv:

#https://stackoverflow.com/q/845058
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

len_of_file = file_len('test.csv')
print (len_of_file)

skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)

df = pd.read_csv('test.csv', skiprows=skipped)

How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.

Idea is read only datetime column by parameter usecols, then sorting and select each 500 index value, get difference and pass again to paramter skiprows:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

len_of_file = file_len('test.csv')

df1 = pd.read_csv('test.csv', 
                  usecols=['tpep_pickup_datetime'], 
                  parse_dates=['tpep_pickup_datetime'])

sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
                 .iloc[np.arange(0,len_of_file,500)].index)

skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)

df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])

来源：https://stackoverflow.com/questions/53812094/select-every-nth-row-as-a-pandas-dataframe-without-reading-the-entire-file

标签

pandas

dataframe

time-series