问题
I am having to tackle a problem that far exceeds my current programming skill for Python. I am having difficulty combining different modules (csv reader, numpy etc.) into a single script. My data contains a large list of weather variables across time (with minute resolution) for many days. My objective is to determine the trend of the wind speed between 9am and 12pm of every day in the list. If the gradient of the wind speed is positive, I wish to write the date on which this occurred to a new csv file, along with what the wind direction was.
The data extends for thousands of rows and looks like this:
hd,Station Number,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Local time,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Local standard time,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Universal coordinated time,Precipitation since last (AWS) observation in mm,Quality of precipitation since last (AWS) observation value,Air Temperature in degrees Celsius,Quality of air temperature,Air temperature (1-minute maximum) in degrees Celsius,Quality of air temperature (1-minute maximum),Air temperature (1-minute minimum) in degrees Celsius,Quality of air temperature (1-minute minimum),Wet bulb temperature in degrees Celsius,Quality of Wet bulb temperature,Wet bulb temperature (1 minute maximum) in degrees Celsius,Quality of wet bulb temperature (1 minute maximum),Wet bulb temperature (1 minute minimum) in degrees Celsius,Quality of wet bulb temperature (1 minute minimum),Dew point temperature in degrees Celsius,Quality of dew point temperature,Dew point temperature (1-minute maximum) in degrees Celsius,Quality of Dew point Temperature (1-minute maximum),Dew point temperature (1 minute minimum) in degrees Celsius,Quality of Dew point Temperature (1 minute minimum),Relative humidity in percentage %,Quality of relative humidity,Relative humidity (1 minute maximum) in percentage %,Quality of relative humidity (1 minute maximum),Relative humidity (1 minute minimum) in percentage %,Quality of Relative humidity (1 minute minimum),Wind (1 minute) speed in km/h,Wind (1 minute) speed quality,Minimum wind speed (over 1 minute) in km/h,Minimum wind speed (over 1 minute) quality,Wind (1 minute) direction in degrees true,Wind (1 minute) direction quality,Standard deviation of wind (1 minute),Standard deviation of wind (1 minute) direction quality,Maximum wind gust (over 1 minute) in km/h,Maximum wind gust (over 1 minute) quality,Visibility (automatic - one minute data) in km,Quality of visibility (automatic - one minute data),Mean sea level pressure in hPa,Quality of mean sea level pressure,Station level pressure in hPa,Quality of station level pressure,QNH pressure in hPa,Quality of QNH pressure,#
hd, 40842,2000,03,20,10,50,2000,03,20,10,50,2000,03,20,00,50, ,N, 25.7,N, 25.7,N, 25.6,N, 21.5,N, 21.5,N, 21.4,N, 19.2,N, 19.2,N, 19.0,N, 67,N, 68,N, 66,N, 13,N, 9,N,100,N, 4,N, 15,N, ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,51,2000,03,20,10,51,2000,03,20,00,51, 0.0,N, 25.6,N, 25.8,N, 25.6,N, 21.5,N, 21.6,N, 21.5,N, 19.2,N, 19.4,N, 19.2,N, 68,N, 68,N, 66,N, 11,N, 9,N,107,N, 11,N, 13,N, ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,52,2000,03,20,10,52,2000,03,20,00,52, 0.0,N, 25.8,N, 25.8,N, 25.6,N, 21.7,N, 21.7,N, 21.5,N, 19.5,N, 19.5,N, 19.2,N, 68,N, 69,N, 66,N, 11,N, 9,N, 83,N, 13,N, 13,N, ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,53,2000,03,20,10,53,2000,03,20,00,53, 0.0,N, 25.8,N, 25.9,N, 25.8,N, 21.6,N, 21.8,N, 21.6,N, 19.3,N, 19.6,N, 19.3,N, 67,N, 68,N, 66,N, 9,N, 8,N, 87,N, 14,N, 11,N, ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,54,2000,03,20,10,54,2000,03,20,00,54, 0.0,N, 25.8,N, 25.8,N, 25.8,N, 21.6,N, 21.6,N, 21.6,N, 19.3,N, 19.3,N, 19.2,N, 67,N, 67,N, 67,N, 8,N, 4,N, 98,N, 23,N, 9,N, ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,55,2000,03,20,10,55,2000,03,20,00,55, 0.0,N, 25.7,N, 25.8,N, 25.7,N, 21.5,N, 21.6,N, 21.5,N, 19.2,N, 19.3,N, 19.2,N, 67,N, 68,N, 66,N, 8,N, 4,N, 68,N, 15,N, 9,N, ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,56,2000,03,20,10,56,2000,03,20,00,56, 0.0,N, 25.9,N, 25.9,N, 25.7,N, 21.7,N, 21.7,N, 21.5,N, 19.4,N, 19.4,N, 19.2,N, 67,N, 68,N, 66,N, 8,N, 5,N, 69,N, 16,N, 9,N, ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,57,2000,03,20,10,57,2000,03,20,00,57, 0.0,N, 26.0,N, 26.0,N, 25.9,N, 21.8,N, 21.8,N, 21.7,N, 19.5,N, 19.5,N, 19.4,N, 67,N, 68,N, 66,N, 9,N, 5,N, 72,N, 10,N, 11,N, ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,58,2000,03,20,10,58,2000,03,20,00,58, 0.0,N, 26.0,N, 26.1,N, 26.0,N, 21.7,N, 21.8,N, 21.7,N, 19.4,N, 19.5,N, 19.3,N, 66,N, 67,N, 66,N, 8,N, 5,N, 69,N, 13,N, 11,N, ,N,1018.6,N,1017.5,N,1018.6,N,#
The completed file which contains only dates in which the wind speed increased from 9am to 12pm will hopefully be of the form below:
date,wind direction,gradient_of_wind_speed,
2000/3/25,108,0.7,
2000/4/17,67,0.4,
...
The exact value of the gradient is not of importance, only whether it is positive, so it would be fine to construct a second array of the form (1,2,3,4,5...) to use as the second dimension of the array for the linear regression. The challenge lies in the fact that many days have missing data, so although the array should have length 180 (i.e. 180 minutes between 9am and 12pm) it will in actuality have a varying length.
Is this challenge more easily tackled through multiple scripts (bearing in mind I have to do this for 100+ files) or is there some easy way of tackling this challenge in a single script?
Attempted code:
import glob
import pandas as pd
import numpy as np
for file in glob.glob('X:/brisbaneweatherdata/*.txt'):
df = pd.read_csv(file)
for date, group in df.groupby(['Year Month Day Hours Minutes in YYYY','MM','DD']):
morning_data = group[group.HH24.between('09','12')]
# calculate your linear regression here
gradient, intercept = np.polyfit(morning_data.HH24,morning_data['Wind (1 minute) speed in km/h'], 1)
wind_direction= np.average(morning_data.HH24,morning_data['Wind (1 minute) direction in degrees true'])
if gradient > 0 :
print(date + "," + gradient + "," + wind_direction)
error message that is recieved:
runfile('X:/python/linearregression.py', wdir='X:/python')
X:/python/linearregression.py:1: DtypeWarning: Columns (17,25,27,29,31,33,35,37,55,57,59) have mixed types. Specify dtype option on import or set low_memory=False.
import glob
Traceback (most recent call last):
File "<ipython-input-26-ace8af14da2c>", line 1, in <module>
runfile('X:/python/linearregression.py', wdir='X:/python')
File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "X:/python/linearregression.py", line 8, in <module>
morning_data = group[group.HH24.between('09','12')]
File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\series.py", line 2486, in between
lmask = self >= left
File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\ops.py", line 761, in wrapper
res = na_op(values, other)
File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\ops.py", line 716, in na_op
raise TypeError("invalid type comparison")
TypeError: invalid type comparison
回答1:
I think you should be able to do this in a fairly simple script using glob
to iterate through your files, and pandas
to read in your data. Here is a basic outline of how I would structure it
import glob
import pandas as pd
for file in glob.glob('data/*'):
df = pd.read_csv(file)
for date, group in df.groupby(['year','month','day']:
morning_data = group[group.HH24.between('09','12')]
# calculate your linear regression here
gradient, intercept = np.polyfit(morning_data.HH24,morning_data['wind speed'], 1)
if gradient > 0 :
print(gradient + "," + wind_direction + "," + gradient)
来源:https://stackoverflow.com/questions/36612664/reading-csv-to-array-performing-linear-regression-on-array-and-writing-to-csv-i