imputation | 易学教程

Oversampling: SMOTE for binary and categorical data in Python

阅读更多关于 Oversampling: SMOTE for binary and categorical data in Python

问题 I would like to apply SMOTE to unbalanced dataset which contains binary, categorical and continuous data. Is there a way to apply SMOTE to binary and categorical data? 回答1: As of Jan, 2018 this issue has not been implemened in Python. Following is a reference from the team. Infact they are open to proposals if someone wants to implement it. For those with an academic interest in this ongoing issue, the paper from Chawla & Bowyer addresses this SMOTE-Non Continuous sampling problem in section

How to replace missing values with group mode in Pandas?

阅读更多关于 How to replace missing values with group mode in Pandas?

I follow the method in this post to replace missing values with the group mode, but encounter the "IndexError: index out of bounds". df['SIC'] = df.groupby('CIK').SIC.apply(lambda x: x.fillna(x.mode()[0])) I guess this is probably because some groups have all missing values and do not have a mode. Is there a way to get around this? Thank you! mode is quite difficult, given that there really isn't any agreed upon way to deal with ties. Plus it's typically very slow. Here's one way that will be "fast". We'll define a function that calculates the mode for each group, then we can fill the missing

How extract complete dataset from Amelia package

阅读更多关于 How extract complete dataset from Amelia package

In mice package for extract complete dataset you can use complete() command as follow : install.packages("mice") library ("mice") imp1=mice(nhanes,10) fill1=complete(imp,1) fill2=complete(imp,2) fillall=complete(imp,"long") But can some one tell me how to extract complete dataset in Amelia package?? install.packages("Amelia") library ("Amelia") imp2= amelia(freetrade, m = 5, ts = "year", cs = "country") The str() function is always helpful here. You'll see that the complete datasets are stored in the imputations element of the object returned by amelia() : > str(imp2, 1) List of 12 $

Replacing NA's in each column of matrix with the median of that column

阅读更多关于 Replacing NA's in each column of matrix with the median of that column

问题 I am trying to replace the NA's in each column of a matrix with the median of of that column, however when I try to use lapply or sapply I get an error; the code works when I use a for-loop and when I change one column at a time, what am I doing wrong? Example: set.seed(1928) mat <- matrix(rnorm(100*110), ncol = 110) mat[sample(1:length(mat), 700, replace = FALSE)] <- NA mat1 <- mat2 <- mat mat1 <- lapply(mat1, function(n) { mat1[is.na(mat1[,n]),n] <- median(mat1[,n], na.rm = TRUE) } ) for (n

Can MICE pool complete GLM output binary logistic regression?

阅读更多关于 Can MICE pool complete GLM output binary logistic regression?

I am running a logistic regression with a binary outcome variable on data that has been multiply imputed using MICE. It seems straightforward to pool the coefficients of the glm model: imp=mice(nhanes2, print=F) imp$meth fit0=with(data=imp, glm(hyp~age, family = binomial)) fit1=with(data=imp, glm(hyp~age+chl, family = binomial)) summary(pool(fit1)) However, I can't figure out a way to pool other output generated by the glm. For instance, the glm function produces AIC, Null deviance and Residual deviance that can be used for model testing. pool(summary(fit1)) ## summary of imputation 1 : Call:

Predicting missing values with scikit-learn's Imputer module

阅读更多关于 Predicting missing values with scikit-learn's Imputer module

I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class. I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array. When I print the array after performing fit_transform(), the 'Nan's remain, and I dont get any prediction. What am I doing wrong here? How do I go about predicting the missing values? import numpy as np from sklearn.preprocessing import Imputer X = np.array([[23.56],[53.45],['NaN'],[44.44],[77.78],['NaN'],[234.44],[11.33],[79.87]]) print X imp = Imputer(missing

mean-before-after imputation in R

阅读更多关于 mean-before-after imputation in R

问题 I'm new in R. My question is how to impute missing value using mean of before and after of the missing data point? example; using the mean from the upper and lower of each NA as the impute value. -mean for row number 3 is 38.5 -mean for row number 7 is 32.5 age 52.0 27.0 NA 23.0 39.0 32.0 NA 33.0 43.0 Thank you. 回答1: Here a solution using from na.locf from zoo package which replaces each NA with the most recent non-NA prior or posterior to it. 0.5*(na.locf(x,fromlast=TRUE) + na.locf(x)) [1]

multinominal regression with imputed data

阅读更多关于 multinominal regression with imputed data

问题 I need to impute missing data and then coduct multinomial regression with the generated datasets. I have tried using mice for the imputing and then multinom function from nnet for the multnomial regression. But this gives me unreadable output. Here is an example using the nhanes2 dataset available with the mice package: library(mice) library(nnet) test <- mice(nhanes2, meth=c('sample','pmm','logreg','norm')) #age is categorical, bmi is continuous m <- with(test, multinom(age ~ bmi, model = T)

Impute missing data with mean by group

阅读更多关于 Impute missing data with mean by group

I have a categorical variable with three levels ( A , B , and C ). I also have a continuous variable with some missing values on it. I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be replaced with the mean of group A . I know I can just calculate each group's mean and replace missing values, but I'm sure there's another way to do so more efficiently with loops. A <- subset(data, group == "A") mean(A$variable, rm.na = TRUE) A$variable[which(is.na(A$variable))] <- mean(A$variable, na.rm = TRUE) Now, I understand I could do the

Impute missing data with mean by group

阅读更多关于 Impute missing data with mean by group

问题 I have a categorical variable with three levels ( A , B , and C ). I also have a continuous variable with some missing values on it. I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be replaced with the mean of group A . I know I can just calculate each group's mean and replace missing values, but I'm sure there's another way to do so more efficiently with loops. A <- subset(data, group == "A") mean(A$variable, rm.na = TRUE) A