问题
I believe that my problem is really straightforward and there must be a really easy way to solve this issue, however as I am quite new with Python, specially pandas, I could not sort it out by my own.
I have hundreds of csv files that are on the following format:
text_2014-02-22_13-00-00
So the format is str_YY-MM-DD_HH-MI-SS. And to sum up, every file represents a interval of one hour.
I want to create a dataframe based on the interval that I will set with Start_Time and End_Time, from that interval. So, if for example, I set Start_Time as 2014-02-22 21:40:00 and End_Time as 2014-02-22 22:55:00 (The time-format that I am using is just to illustrate the example), then I will get a dataframe which comprehends the data in between the aforementioned interval , which comes from two different files.
So, I believe that this problem might be divided into two parts:
1 - Read just the date out of the file name
2 - Create a dataframe based on the time interval that I set.
Hope that I managed to be succinct and precise. I would really appreciate your help on this one! Suggestions of what to look up for are also welcome
回答1:
The solution has a few different parts.
- create path to folder
- manually created 3 csv files
- save csv files to a list
- write a custom function to parse the filename into a datetime object
- bring it all together, loop through the csv files in the folder
import os
import pandas as pd
import datetime
# step 1: create the path to folder
path_cwd = os.getcwd()
# step 2: manually 3 sample CSV files
df_1 = pd.DataFrame({'Length': [10, 5, 6],
'Width': [5, 2, 3],
'Weight': [100, 120, 110]
}).to_csv('text_2014-02-22_13-00-00.csv', index=False)
df_2 = pd.DataFrame({'Length': [11, 7, 8],
'Width': [4, 1, 2],
'Weight': [101, 111, 131]
}).to_csv('text_2014-02-22_14-00-00.csv', index=False)
df_3 = pd.DataFrame({'Length': [15, 9, 7],
'Width': [1, 4, 2],
'Weight': [200, 151, 132]
}).to_csv('text_2014-02-22_15-00-00.csv', index=False)
# step 3: save the contents of the folder to a list
list_csv = os.listdir(path_cwd)
list_csv = [x for x in list_csv if '.csv' in x]
print('here are the 3 CSV files in the folder: ')
print(list_csv)
# step 4: extract the datetime from filenames
def get_datetime_filename(str_filename):
'''
Function to grab the datetime from the filename.
Example: 'text_2014-02-22_13-00-00.csv'
'''
# split the filename by the underscore
list_split_file = str_filename.split('_')
# the 2nd part is the date
str_date = list_split_file[1]
# the 3rd part is the time, remove the '.csv'
str_time = list_split_file[2]
str_time = str_time.split('.')[0]
# combine the 2nd and 3rd parts
str_datetime = str(str_date + ' ' + str_time)
# convert the string to a datetime object
# https://chrisalbon.com/python/basics/strings_to_datetime/
# https://stackoverflow.com/questions/10663720/converting-a-time-string-to-seconds-in-python
dt_datetime = datetime.datetime.strptime(str_datetime, '%Y-%m-%d %H-%M-%S')
return dt_datetime
# Step 5: bring it all together
# create empty dataframe
df_master = pd.DataFrame()
# loop through each csv files
for each_csv in list_csv:
# full path to csv file
temp_path_csv = os.path.join(path_cwd, each_csv)
# temporary dataframe
df_temp = pd.read_csv(temp_path_csv)
# add a column with the datetime from filename
df_temp['datetime_source'] = get_datetime_filename(each_csv)
# concatenate dataframes
df_master = pd.concat([df_master, df_temp])
# reset the dataframe index
df_master = df_master.reset_index(drop=True)
# examine the master dataframe
print(df_master.shape)
# print(df_master.head(10))
df_master.head(10)
来源:https://stackoverflow.com/questions/58401804/create-a-dataframe-of-csv-files-based-on-timestamp-intervals