Normalization by using 2 times gaussian function on negative and positive numbers of data in dataframe

问题

I'm trying to read the dataset from text file and extract 3 main parameters and put them in separate list and apply normalization on lists of parameters which are (A, B, C) after assigning Gaussian distribution function. For getting good result I split up positive and negative numbers of each parameters' list and apply gaussian distribution function on them separately and pick:

mean value of negative numbers as the real Minimum

mean value of positive numbers as the real Maximum

instead of directly find Min Max values in main list of these parameters which could repeat few times due to they're not in desired confidence interval. All could be done in 4 steps as it can be seen in below picture includes 2 plots in form of 24x20 matrix regarding one of columns in dataframe which is so unclear due to huge regression and scattering of data and unwanted noise which make it not interpretable :

Considering there's not certain answer for normalization between certain interval like [a,b] I define the function based on its formula:

def normalize(value, min_value, max_value, min_norm, max_norm):
    new_value = ((max_norm - min_norm)*((value - min_value)/(max_value - min_value))) + min_norm
return new_value

So my scripts are below:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import warnings
warnings.filterwarnings("ignore",category =RuntimeWarning)

dft = pd.read_csv('D:/me.txt', header=None)
id_set = dft[dft.index % 4 == 0].astype('int').values
A = dft[dft.index % 4 == 1].values
B = dft[dtf.index % 4 == 2].values
C = dft[dft.index % 4 == 3].values
data = {'A': A[:,0], 'B': B[:,0], 'C': C[:,0]} # arrays
#main_data contains all the data
df = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0]) 
df  = df.replace([np.inf, -np.inf], np.nan).astype(np.float64) 
df  = df.fillna(0.012345)


def normalize(value, min_value, max_value, min_norm, max_norm):
new_value = ((max_norm - min_norm)*((value - min_value)/(max_value - min_value))) + min_norm
return new_value

def createpositiveandnegativelist(listtocreate):
l_negative = []
l_positive = []
for value in listtocreate:
    if (value < 0):
        l_negative.append(value)
    elif (value > 0):
        l_positive.append(value)
#print(t_negative)
#print(t_positive)
return l_negative,l_positive

def calculatemean(listtocalculate):
return sum(listtocalculate)/len(listtocalculate)

def plotgaussianfunction(mu,sigma):
s = np.random.normal(mu, sigma,1000)
abs(mu - np.mean(s))<0.01
abs(sigma - np.std(s,ddof=1))<0.01
#count, bins, ignored = plt.hist(s,30,density=True)
#plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp(-(bins-mu)**2/(2*sigma**2)),linewidth=2, color= 'r')
#plt.show()
return


def plotboundedCI(s, mu, sigma, lists):
plt.figure()

count, bins, ignored = plt.hist(s,30,density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp(-(bins-mu)**2/(2*sigma**2)),linewidth=2, color= 'r')
#confidential interval calculation
ci = scipy.stats.norm.interval(0.68, loc = mu, scale = sigma)
#confidence interval for left line
one_x12, one_y12 = [ci[0],ci[0]], [0,3]
#confidence interval for right line
two_x12, two_y12 = [ci[1],ci[1]], [0,3]

plt.title("Gaussian 68% Confidence Interval", fontsize=12, color='black', loc='left', style='italic')
plt.plot(one_x12, one_y12, two_x12, two_y12, marker = 'o')
plt.show()

#get sure about avoiding the outliers of CI 
results = []
for value in lists:
    if(ci[0]< value <ci[1]):
        results.append(value)
    else:
        #print("NOT WANTED: ",value)
        pass

return results



c_negative, c_positive = createpositiveandnegativelist(C)
b_negative, b_positive = createpositiveandnegativelist(B)
a_negative, a_positive = createpositiveandnegativelist(A)

#get max and min values
a_min = main_data['A'].min()
a_max = main_data['A'].max()
b_min = main_data['B'].min()
b_max = main_data['B'].max()
c_min = main_data['C'].min()
c_max = main_data['C'].max()

print ("\ntmp Negative Min",c_min)
print ("\n tmp Positive Max",c_max)

#calculating the mean value
c_p_mean = calculatemean(c_positive)
b_p_mean = calculatemean(b_positive)
a_p_mean = calculatemean(a_positive)
c_n_mean = calculatemean(c_negative)
b_n_mean = calculatemean(b_negative)
a_n_mean = calculatemean(a_negative)
print ("\ntmp Negative Mean",c_n_mean)
print ("\n tmp Positive Mean",c_p_mean)

#calculating the sigma value
c_sigma_Negative = np.std(c_negative)
c_sigma_Positive = np.std(c_positive)
b_sigma_Negative = np.std(c_negative)
b_sigma_Positive = np.std(c_positive)
a_sigma_Negative = np.std(c_negative)
a_sigma_Positive = np.std(c_positive)

#plot the gaussian function with histograms
plotgaussianfunction(c_p_mean, c_sigma_Positive)
plotgaussianfunction(c_n_mean, c_sigma_Negative)
plotgaussianfunction(b_p_mean, b_sigma_Positive)
plotgaussianfunction(b_n_mean, b_sigma_Negative)
plotgaussianfunction(a_p_mean, a_sigma_Positive)
plotgaussianfunction(a_n_mean, a_sigma_Negative)

#normalization
c_p_s = np.random.normal(c_p_mean, c_sigma_Positive, 1000)
c_n_s = np.random.normal(c_n_mean, c_sigma_Negative, 1000)
b_p_s = np.random.normal(b_p_mean, b_sigma_Positive, 1000)
b_n_s = np.random.normal(b_n_mean, b_sigma_Negative, 1000)
a_p_s = np.random.normal(a_p_mean, a_sigma_Positive, 1000)
a_n_s = np.random.normal(a_n_mean, a_sigma_Negative, 1000)

#histograms minus the outliers
c_p_results = plotboundedCI(c_p_s, c_p_mean, c_sigma_Positive, c_positive)
c_n_results = plotboundedCI(c_n_s, c_n_mean, c_sigma_Negative, c_negative)
b_p_results = plotboundedCI(b_p_s, b_p_mean, b_sigma_Positive, b_positive)
b_n_results = plotboundedCI(b_n_s, b_n_mean, b_sigma_Negative, b_negative)
a_p_results = plotboundedCI(a_p_s, a_p_mean, a_sigma_Positive, a_positive)
a_n_results = plotboundedCI(a_n_s, a_n_mean, a_sigma_Negative, a_negative)



#next iteration create all plots, change the number of cycles
for i in df:
    if 'C' in i:
        min_nor = -40
        max_nor = 150
        #Applying normalization for C between [-40,+150]
        new_value3 = normalize(df['C'].iloc[j:j+480], c_n_mean, c_p_mean, -40, 150)
        n_cbar_kws = {"ticks":[-40,150,-20,0,25,50,75,100,125]}
        df3 = print_df(mkdf(new_value3))
        df3.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)  
    else:
        #Applying normalizayion for A,B between    [-1,+1]
        new_value1 = normalize(df['A'].iloc[j:j+480], a_n_mean, a_p_mean, -1, 1)
        new_value2 = normalize(df['B'].iloc[j:j+480], b_n_mean, b_p_mean, -1, 1)
        n_cbar_kws = {"ticks":[-1.0,-0.75,-0.50,-0.25,0.00,0.25,0.50,0.75,1.0]}
        df1 = print_df(mkdf(new_value1))
        df2 = print_df(mkdf(new_value2))
        df1.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None) 
        df2.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)     

'''
for i in df:

if i=='C':
    #Applying normalizayion for C between [-40,+150]
    data['C']  = normalize(df[i].values, c_n_mean, c_p_mean, -40, 150)
elif i=='A':
    #Applying normalization for A , B between [-1,+1]
    data['A'] = normalize(df[i].values, a_n_mean, a_p_mean, -1, 1)
else:
    data['B'] = normalize(df[i].values, b_n_mean, b_p_mean, -1, 1)
'''


norm_data = pd.DataFrame(data, index = id_set[:,0])
print(norm_data)
norm_data.to_csv('norm.csv')

The problems are:

ّProblem I: I faced RunTimeWarning error which I avoided already but still, I have below error(s) which I don't have any clue how I can solve them includes ValueError: scale <0 probably due to separation of negative & positive numbers.

ّProblem II: After normalization which could be happen for more than almost 10 cycles(each cycle has 480 values) I've realized considering applying Gaussian for both -/+ numbers and then use Min,max normalization between [a,b] still there numbers of values out of range in this case you can see in below picture or column 'C' out of [-40,+150] which so strange and out my expectation !

Hope that someone has a good idea about a solution for errors or a better way to apply normalization by using Gaussian distribution function Thanks for your attention.

Note1: I have some missing data(nan or inf) in my list of values which are already replaced by zero (exactly = 0.012345)! but considering that when I have no missing values in my list of parameters, the code works!

Note2: I've faced this error for this: ValueError: max must be larger than min in range parameter. for count, bins, ignored = plt.hist(s,30,density=True) and plotgaussianfunction(t_p_mean, t_sigma_Positive) which I think is regarding this condition in deed : abs(sigma - np.std(s,ddof=1)) < 0.01 since I had similar error ValueError: scale < 0 for s = np.random.normal(mu, np.abs(sigma) ,1000) I asked already here

I provided a sample of the dataset for 3 cycles: dataset

I also provided another dataset for 11 cycles: dataset

来源：https://stackoverflow.com/questions/54330676/normalization-by-using-2-times-gaussian-function-on-negative-and-positive-number

标签

python

dataframe

normalization

gaussian