Statistical regression on multi-dimensional data [closed]

问题

I have a set of data in (x, y, z) format where z is the output of some formula involving x and y. I want to find out what the formula is, and my Internet research suggests that statistical regression is the way to do this.

However, all of the examples I have found while researching only deal with two-dimensional data sets (x, y) which is not useful for my situation. Said examples also don't seem to provide a way to see what the resulting formula is, they just provide a function for predicting future outputs based on data not in a training data set.

The level of precision needed is that the formula for z needs to produce results within +/- 0.5 of actual values.

Can anyone tell me how I can do what I want to do? Please note I was not asking for specific recommendations on a software library to use.

回答1:

If the formula is a linear function, checkout this tutorial. It uses Ordinary least squares to fit your data which is quite powerful.

Assume that you have data points (x1, y1, z1), (x2, y2, z2), ..., (xn, yn, zn), transform them into three separated numpy arrays X, Y and Z.

import numpy as np
X = np.array([x1, x2, ..., xn])
Y = np.array([y1, y2, ..., yn])
Z = np.array([z1, z2, ..., zn])

Then, use ols to fit them!

import pandas
from statsmodels.formula.api import ols

# Your data.
# Z = a*X + b*Y + c
data = pandas.DataFrame({'x': X, 'y': Y, 'z': Z})

# Fit your data with ols model.
model = ols("Z ~ X + Y", data).fit()

# Get your model summary.
print(model.summary())

# Get your model parameters.
print(model._results.params)
# should be approximately array([c, a, b])

If more variables are presented

Add as much variables in the DataFrame as you like.

# Your data.
data = pandas.DataFrame({'v1': V1, 'v2': V2, 'v3': V3, 'v4': V4, 'z': Z})

Reference

Python package StatsModel

回答2:

The most basic tool you need to use is Multiple linear regression. The basic method models z as a linear function of x and y, added a Gaussian noise e on top of them: f(x,y) = a1*x + a2*y + a3 and then z is produced as f(x,y) + e, where e is usually a zero mean Gaussian with unknown variance. You need to find the coefficients a1,a2 and the bias a3, which are usually estimated with Maximum Likelihood, which then boils down to ordinary least squares under the Gaussian assumption. It has closed form analytic solution.

Since you have access to Python, take a look to linear regression in scikit-learn: http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

回答3:

If you can reuse code from an existing a Python 3 tkinter GUI application on GitHub, take a look at fitting the linear polynomial surface equation that you mentioned using my tkInterFit project - it will also create fitted surface and contour plots. The GitHub source code is at https://github.com/zunzun/tkInterFit with a BSD license.

来源：https://stackoverflow.com/questions/44984035/statistical-regression-on-multi-dimensional-data

标签

python

math

statistics

regression