Extracting data belonging to a day from a given range of dates on a dataset

问题

I have a data set with a date range from January 12th to August 3rd of 2018 with some values:

The dimensionality of my_df DataFrame is:

my_df.shape 
(9752, 2)

Each row contains half hour frequency

The first row begins at 2018-01-12

my_df.iloc[0]
Date:       2018-01-12 00:17:28
Value                      1
Name: 0, dtype: object

And the last row ending at 2018-08-03

my_df.tail(1)
                  Date:     Value
9751    2018-08-03 23:44:59  1

My goal is to select the data rows corresponding to each day and export it to a CSV file.

To get only the January 12th data and save to readable file, I perform:

# Selecting data value of each day
my_df_Jan12 = my_df[(my_df['Fecha:']>='2018-01-12 00:00:00') 
              & 
              (my_df['Fecha:']<='2018-01-12 23:59:59')
                                   ]
my_df_Jan12.to_csv('Data_Jan_12.csv', sep=',', header=True, index=False)

From January 12 to August 03 there are 203 days (28 weeks)

I don't want to perform this query by each day manually, then I am trying the following basic analysis:

I need to generate 203 files (1 file by each day)
The day on January starting on 12 (January 12)
January is the first month (01) and August is the eighth month(08)

Then:

I need to iterate over the 203 days totality
- and is necessary in each date row value check the month and day value date with the order to check the change of each one of them

According to the above, I am trying this approach:

# Selecting data value of each day (203 days)
for i in range(203):
    for j in range(1,9): # month
        for k in range(12,32): # days of the month
            values = my_df[(my_df['Fecha:']>='2018-0{}-{} 00:00:00'.format(j,k)) 
            &  
            (my_df['Fecha:']<='2018-0{}-{} 23:59:59'.format(j,k))]
            values.to_csv('Values_day_{}.csv'.format(i), sep=',', header=True, index=False)

But I have the problem in the sense of when I iterate of range(12,32) in the days of the months, this range(12,32) only apply to first January month, I think so ...

Finally, I get 203 empty CSV files, due to something I am doing wrong ...

How to can I address this small challenge of the suited way? Any orientation is highly appreciated

回答1:

Something like this? I renamed your original column of Date: to Timestamp. I am also assuming that the Date: Series you have is a pandas DateTime series.

my_df.columns = ['Timestamp', 'Value']
my_df['Date'] = my_df['Timestamp'].apply(lambda x: x.date())
dates = my_df['Date'].unique()
for date in dates:
    f_name = str(date) + '.csv'
    my_df[my_df['Date'] == date].to_csv(f_name)

回答2:

`groupby`

for date, d in df.groupby(pd.Grouper(key='Date', freq='D')):
  d.to_csv(f"Data_{date:%b_%d}.csv", index=False)

Notice I used an f-string which is Python 3.6+
Otherwise, use this

for date, d in df.groupby(pd.Grouper(key='Date', freq='D')):
  d.to_csv("Data_{:%b_%d}.csv".format(date), index=False)

Consider the df

df = pd.DataFrame(dict(
    Date=pd.date_range('2010-01-01', periods=10, freq='12H'),
    Value=range(10)
))

Then

for date, d in df.groupby(pd.Grouper(key='Date', freq='D')):
  d.to_csv(f"Data_{date:%b_%d}.csv", index=False)

And verify

from pathlib import Path

print(*map(Path.read_text, Path('.').glob('Data*.csv')), sep='\n')

Date,Value
2010-01-05 00:00:00,8
2010-01-05 12:00:00,9

Date,Value
2010-01-04 00:00:00,6
2010-01-04 12:00:00,7

Date,Value
2010-01-02 00:00:00,2
2010-01-02 12:00:00,3

Date,Value
2010-01-01 00:00:00,0
2010-01-01 12:00:00,1

Date,Value
2010-01-03 00:00:00,4
2010-01-03 12:00:00,5

来源：https://stackoverflow.com/questions/52265151/extracting-data-belonging-to-a-day-from-a-given-range-of-dates-on-a-dataset

标签

python

pandas

extract