how can i transformed descriptors mean centered and scaled to unit variance prior to Machine learning modeling using python and pandas

删除回忆录丶 提交于 2019-12-12 01:37:16

问题


How can I transform the given data set as mean centred and scaled to unit variance using pandas or numpy or any appropriate python module, data also contain some missing values as "Nan" that should also be removed prior to modelling task pleas help .

thanks

Ex. data set.

GA_ID   PN_ID   PC_ID   MBP_ID  GR_ID   AP_ID   class
0.033   6.652   6.681   0.194   0.874   3.177     0
0.034   9.039   6.224   0.194   1.137   Nan       0
0.035   10.936  10.304  1.015   0.911   4.9       1
0.022   10.11   9.603   1.374   0.848   4.566     1
0.035   2.963   17.156  0.599   0.823   9.406     1
0.033   10.872  10.244  1.015   0.574   4.871     1
0.035   21.694  22.389  1.015   0.859   9.259     1
0.035   10.936  10.304  1.015   0.911   Nan       1
0.035   10.936  10.304  1.015   0.911   4.9       1
0.035   10.936  10.304  1.015   0.911   4.9       0
0.036   1.373   12.034  0.35    0.259   5.723     0
0.033   9.831   9.338   0.35    0.919   4.44      0

I have used:

from sklearn import preprocessing
import numpy as np
raw_data = open("/home/zebrafish/Desktop/scklearn/data.csv")
dataset = np.loadtxt(raw_data, delimiter=",")
X = dataset[:,0:5]
y = dataset[:,6]
X_pro = preprocessing.scale(X)

but I am not sure wither this method is current or and would it ignore the "Nan" or it will automatically take appropriate steps for "Nan" because in original data there was no "Nan" value but to understand the solution if it occurs I have incorporated "Nan"manually at two positions.

thanks


                   Question Update 

With some googling and playing around the data probably i found that this method may normalizing data on Row basis and I want to normalize data with column basis.

So what would be the appropriate method for column basis normalization.

thanks


回答1:


As you have already started, an easy way to accomplish this is via the preprocessing library of sklearn

You can start by removing NaN values:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='Nan', strategy='mean', axis=1)
cleaned_X = imp.fit_transform(X)

In this scenario, your 'Nan' values will be replaced by the mean of the rest across that column (AP_ID), as opposed to dropping the rows completely (and losing data).

Next, in order to normalize your data on a column basis, your method is actually correct:

scaled_X = preprocessing.scale(cleaned_X)

By default, sklearn will normalize your variables by feature (column) ; to normalize by sample (row) you can add 'axis = 1' to the arguments of the scale function. However, doubt you would ever want to do that.

For reference: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html

One point worth noting is if your statistical analysis later on (say linear regression or what have you) requires an assumption of no significant correlations across features and you notice that there are a lot of correlation across features - scaling each column independently will not be sufficient (which preprocessing.scale does automatically).

If that indeed is the case, I would suggest to first use sklearn's PCA decomposition with 'whiten = True'. This will effectively scale the data to unit variance and zero mean while removing linear correlations across features (by projecting into orthogonal directions which explain most of the variability of your data).

For reference: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

Hope this helps!



来源:https://stackoverflow.com/questions/29249749/how-can-i-transformed-descriptors-mean-centered-and-scaled-to-unit-variance-prio

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!