imputation

Oversampling: SMOTE for binary and categorical data in Python

允我心安 提交于 2019-12-07 09:00:56
问题 I would like to apply SMOTE to unbalanced dataset which contains binary, categorical and continuous data. Is there a way to apply SMOTE to binary and categorical data? 回答1: As of Jan, 2018 this issue has not been implemened in Python. Following is a reference from the team. Infact they are open to proposals if someone wants to implement it. For those with an academic interest in this ongoing issue, the paper from Chawla & Bowyer addresses this SMOTE-Non Continuous sampling problem in section

How to replace missing values with group mode in Pandas?

我们两清 提交于 2019-12-06 16:49:27
I follow the method in this post to replace missing values with the group mode, but encounter the "IndexError: index out of bounds". df['SIC'] = df.groupby('CIK').SIC.apply(lambda x: x.fillna(x.mode()[0])) I guess this is probably because some groups have all missing values and do not have a mode. Is there a way to get around this? Thank you! mode is quite difficult, given that there really isn't any agreed upon way to deal with ties. Plus it's typically very slow. Here's one way that will be "fast". We'll define a function that calculates the mode for each group, then we can fill the missing

How extract complete dataset from Amelia package

两盒软妹~` 提交于 2019-12-06 05:10:00
In mice package for extract complete dataset you can use complete() command as follow : install.packages("mice") library ("mice") imp1=mice(nhanes,10) fill1=complete(imp,1) fill2=complete(imp,2) fillall=complete(imp,"long") But can some one tell me how to extract complete dataset in Amelia package?? install.packages("Amelia") library ("Amelia") imp2= amelia(freetrade, m = 5, ts = "year", cs = "country") The str() function is always helpful here. You'll see that the complete datasets are stored in the imputations element of the object returned by amelia() : > str(imp2, 1) List of 12 $

Replacing NA's in each column of matrix with the median of that column

只谈情不闲聊 提交于 2019-12-04 03:56:52
问题 I am trying to replace the NA's in each column of a matrix with the median of of that column, however when I try to use lapply or sapply I get an error; the code works when I use a for-loop and when I change one column at a time, what am I doing wrong? Example: set.seed(1928) mat <- matrix(rnorm(100*110), ncol = 110) mat[sample(1:length(mat), 700, replace = FALSE)] <- NA mat1 <- mat2 <- mat mat1 <- lapply(mat1, function(n) { mat1[is.na(mat1[,n]),n] <- median(mat1[,n], na.rm = TRUE) } ) for (n

Can MICE pool complete GLM output binary logistic regression?

喜欢而已 提交于 2019-12-03 10:13:43
I am running a logistic regression with a binary outcome variable on data that has been multiply imputed using MICE. It seems straightforward to pool the coefficients of the glm model: imp=mice(nhanes2, print=F) imp$meth fit0=with(data=imp, glm(hyp~age, family = binomial)) fit1=with(data=imp, glm(hyp~age+chl, family = binomial)) summary(pool(fit1)) However, I can't figure out a way to pool other output generated by the glm. For instance, the glm function produces AIC, Null deviance and Residual deviance that can be used for model testing. pool(summary(fit1)) ## summary of imputation 1 : Call:

Predicting missing values with scikit-learn's Imputer module

我是研究僧i 提交于 2019-12-03 05:05:44
I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class. I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array. When I print the array after performing fit_transform(), the 'Nan's remain, and I dont get any prediction. What am I doing wrong here? How do I go about predicting the missing values? import numpy as np from sklearn.preprocessing import Imputer X = np.array([[23.56],[53.45],['NaN'],[44.44],[77.78],['NaN'],[234.44],[11.33],[79.87]]) print X imp = Imputer(missing

mean-before-after imputation in R

孤街醉人 提交于 2019-12-02 02:59:56
问题 I'm new in R. My question is how to impute missing value using mean of before and after of the missing data point? example; using the mean from the upper and lower of each NA as the impute value. -mean for row number 3 is 38.5 -mean for row number 7 is 32.5 age 52.0 27.0 NA 23.0 39.0 32.0 NA 33.0 43.0 Thank you. 回答1: Here a solution using from na.locf from zoo package which replaces each NA with the most recent non-NA prior or posterior to it. 0.5*(na.locf(x,fromlast=TRUE) + na.locf(x)) [1]

multinominal regression with imputed data

杀马特。学长 韩版系。学妹 提交于 2019-12-01 11:30:35
问题 I need to impute missing data and then coduct multinomial regression with the generated datasets. I have tried using mice for the imputing and then multinom function from nnet for the multnomial regression. But this gives me unreadable output. Here is an example using the nhanes2 dataset available with the mice package: library(mice) library(nnet) test <- mice(nhanes2, meth=c('sample','pmm','logreg','norm')) #age is categorical, bmi is continuous m <- with(test, multinom(age ~ bmi, model = T)

Impute missing data with mean by group

怎甘沉沦 提交于 2019-12-01 08:41:14
I have a categorical variable with three levels ( A , B , and C ). I also have a continuous variable with some missing values on it. I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be replaced with the mean of group A . I know I can just calculate each group's mean and replace missing values, but I'm sure there's another way to do so more efficiently with loops. A <- subset(data, group == "A") mean(A$variable, rm.na = TRUE) A$variable[which(is.na(A$variable))] <- mean(A$variable, na.rm = TRUE) Now, I understand I could do the

Impute missing data with mean by group

喜欢而已 提交于 2019-12-01 05:10:58
问题 I have a categorical variable with three levels ( A , B , and C ). I also have a continuous variable with some missing values on it. I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be replaced with the mean of group A . I know I can just calculate each group's mean and replace missing values, but I'm sure there's another way to do so more efficiently with loops. A <- subset(data, group == "A") mean(A$variable, rm.na = TRUE) A