问题
I am following this blog to identify seasonal customers in my time series data: https://www.kristenkehrer.com/seasonality-code
My code is shamelessly nearly identical to the blog, with some small tweaks, code is below. I am able to run the code entirely, for 2000 customers. A few hours later, 0 customers were flagged as seasonal in my results.
Manually looking at customers data over time, I do believe I have many examples of seasonal customers that should have been picked up. Below is a sample of the data I am using.
Am I missing something stupid? am I in way over my head to even try this, being very new to python?
Note that I am adding the "0 months" in my data source, but I don't think it would hurt anything for that function to check again. I'm also not including the data source credentials step.
Thank you
import pandas as pa
import numpy as np
import pyodbc as py
cnxn = py.connect('DRIVER='+driver+';SERVER='+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password)
original = pa.read_sql_query('SELECT s.customer_id, s.yr, s.mnth, Case when s.usage<0 then 0 else s.usage end as usage FROM dbo.Seasonal s Join ( Select Top 2000 customer_id, SUM(usage) as usage From dbo.Seasonal where Yr!=2018 Group by customer_id ) t ON s.customer_id = t.customer_id Where yr!= 2018 Order by customer_id, yr, mnth', cnxn)
grouped = original.groupby(by='customer_id')
def yearmonth_to_justmonth(year, month):
return year * 12 + month - 1
def fillInForOwner(group):
min = group.head(1).iloc[0]
max = group.tail(1).iloc[0]
minMonths = yearmonth_to_justmonth(min.yr, min.mnth)
maxMonths = yearmonth_to_justmonth(max.yr, max.mnth)
filled_index = pa.Index(np.arange(minMonths, maxMonths, 1), name="filled_months")
group['months'] = group.yr * 12 + group.mnth - 1
group = group.set_index('months')
group = group.reindex(filled_index)
group.customer_id = min.customer_id
group.yr = group.index // 12
group.mnth = group.index % 12 + 1
group.usage = np.where(group.usage.isnull(), 0, group.usage).astype(int)
return group
filledIn = grouped.apply(fillInForOwner)
newIndex = pa.Index(np.arange(filledIn.customer_id.count()))
import rpy2 as r
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri, globalenv
pandas2ri.activate()
base = importr('base')
colorspace = importr('colorspace')
forecast = importr('forecast')
times = importr('timeSeries')
stats = importr('stats')
outfile = 'results.csv'
df_list = []
for customerid, dataForCustomer in filledIn.groupby(by=['customer_id']):
startYear = dataForCustomer.head(1).iloc[0].yr
startMonth = dataForCustomer.head(1).iloc[0].mnth
endYear = dataForCustomer.tail(1).iloc[0].yr
endMonth = dataForCustomer.tail(1).iloc[0].mnth
customerTS = stats.ts(dataForCustomer.usage.astype(int),
start=base.c(startYear,startMonth),
end=base.c(endYear, endMonth),
frequency=12)
r.assign('customerTS', customerTS)
try:
seasonal = r('''
fit<-tbats(customerTS, seasonal.periods = 12,
use.parallel = TRUE)
fit$seasonal
''')
except:
seasonal = 1
df_list.append({'customer_id': customerid, 'seasonal': seasonal})
print(f' {customerid} | {seasonal} ')
seasonal_output = pa.DataFrame(df_list)
print(seasonal_output)
seasonal_output.to_csv(outfile)
回答1:
Kristen here (that's my code). 1 actually means that the customers are not seasonal (or it couldn't pick it up) and NULL also means not seasonal. If they have a seasonal usage pattern (period of 12 months, which is what the code is looking for) it'll output [12].
You can always confirm by inspecting a graph of a single customers behavior and then putting it through the algorithm. I also liked to cross check with a seasonal decomposition algo in either Python or R.
Here is some R code for looking at the decomposition of your time series. If there is no seasonal window in the plot your results are not seasonal:
library(forecast)
myts<-ts(mydata$SENDS, start=c(2013,1),end=c(2018,2),frequency = 12)
plot(decompose(myts))
Also, you mentioned having problems with some of the 0's not filling in (from your twitter conversation) I haven't had this problem but my customers have varying lengths of tenure from 2 years to 13 years. Not sure what the problem is here.
Let me know if I can help with anything else :)
回答2:
Circling back to answer how I got this to work was by just passing the "original" dataframe into the for loop. My data already had the empty $0 months so I didn't need that part of the code to run. Thank you all for your help
来源:https://stackoverflow.com/questions/52954983/r-tbats-model-seasonal-customer-flag-no-results