问题
I'm getting acquainted with Statsmodels so as to shift my more complicated stats completely over to python. However, I'm being cautious, so I'm cross-checking my results with SPSS, just to make sure I'm not making any obvious blunders. Most of time, there's no difference, but I have one example of a two-way ANOVA that's throwing up very different test statistics in Statsmodels and SPSS. (Relevant point: the sample sizes in the ANOVA are mismatched, so ANOVA may not be the appropriate model here.)
I'm selecting my model as follows:
import pandas as pd
import scipy as sp
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
Body = pd.read_csv(filepath)
Body = Body.dropna()
Body_lm = ols('Effect ~ C(Fiction) + C(Condition) + C(Fiction)*C(Condition)', data = Body).fit()
table = sm.stats.anova_lm(Body_lm, typ=2)
The Statsmodels output is as below:
sum_sq df F PR(>F)
C(Fiction) 278.176684 1.0 307.624463 1.682042e-55
C(Condition) 4.294764 1.0 4.749408 2.971278e-02
C(Fiction):C(Condition) 10.776312 1.0 11.917092 5.970123e-04
Residual 520.861599 576.0 NaN NaN
The corresponding SPSS results are these:
Can anyone help explain the difference? Is is perhaps the unequal sample sizes being treated differently under the hood? Or am I choosing the wrong model?
Any help appreciated!
回答1:
You should use sum coding when comparing the means of the variables.
BTW you don't need to specify each variable that are in the interaction term if *
multiply operator is used:
“:” adds a new column to the design matrix with the product of the other two columns.
“*” will also include the individual columns that were multiplied together.
Your model should be:
Body_lm = ols('Effect ~ C(Fiction, Sum)*C(Condition, Sum)', data = Body).fit()
来源:https://stackoverflow.com/questions/42160020/statsmodels-gives-different-anova-results-to-spss