imputation | 易学教程

Generate larger synthetic dataset based on a smaller dataset in Python

阅读更多关于 Generate larger synthetic dataset based on a smaller dataset in Python

问题 I have a dataset with 21000 rows (data samples) and 102 columns (features). I would like to have a larger synthetic dataset generated based on the current dataset, say with 100000 rows, so I can use it for machine learning purposes thereby. I've been referring to the answer by @Prashant on this post https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data, but am unable to get it working on generating a larger synthetic dataset for my data. import numpy as

Missing values in Time Series in python

阅读更多关于 Missing values in Time Series in python

问题 I have a time series dataframe, the dataframe is quite big and contain some missing values in the 2 columns('Humidity' and 'Pressure'). I would like to impute this missing values in a clever way, for example using the value of the nearest neighbor or the average of the previous and following timestamp.Is there an easy way to do it? I have tried with fancyimpute but the dataset contain around 180000 examples and give a memory error 回答1: Consider interpolate (documentation). This example shows

Imputer on some Dataframe columns in Python

阅读更多关于 Imputer on some Dataframe columns in Python

问题 I am learning how to use Imputer on Python. This is my code: df=pd.DataFrame([["XXL", 8, "black", "class 1", 22], ["L", np.nan, "gray", "class 2", 20], ["XL", 10, "blue", "class 2", 19], ["M", np.nan, "orange", "class 1", 17], ["M", 11, "green", "class 3", np.nan], ["M", 7, "red", "class 1", 22]]) df.columns=["size", "price", "color", "class", "boh"] from sklearn.preprocessing import Imputer imp=Imputer(missing_values="NaN", strategy="mean" ) imp.fit(df["price"]) df["price"]=imp.transform(df[

Predicting missing values with scikit-learn's Imputer module

阅读更多关于 Predicting missing values with scikit-learn's Imputer module

问题 I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class. I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array. When I print the array after performing fit_transform(), the 'Nan's remain, and I dont get any prediction. What am I doing wrong here? How do I go about predicting the missing values? import numpy as np from sklearn.preprocessing import Imputer X = np.array([[23

Marginal effects with survey weights and multiple imputations

阅读更多关于 Marginal effects with survey weights and multiple imputations

问题 I am working with survey data that use probability weights and multiple imputations. I would like to get marginal effects after estimating a logit model using the imputed data sets and the survey weights. I cannot figure out how to do this in R. Stata has the package mimrgns which makes it pretty easy. There is also this article (pdf) and supplementary material (pdf) that gives some direction, but I can't seem to apply it to my situation. In the following example, please assume I already

Conditional imputation with LOCF

阅读更多关于 Conditional imputation with LOCF

问题 I've this example of longitudinal data. I need to impute 0, 999 or -1 values according to what occurs before. ID = c(1,1,1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,6,6,6,6) Oxy = c(0, 999, 1, 999, 999, 0, 0, 999, 999, 0, 0, -1, 0, 999, 1, 1, -1, 1, 999, -1, 0, -1, 1,0, 999, 0) Y = c(2010,2011,2012,2013,2014,2011,2012,2013,2010,2011,2012,2010,2011, 2012,2010,2011,2012,2013,2014,2015,2016,2017, 2018,2019,2020, 2021) Oxy2 = c(0, 999, 1, 1, 1, 0, 0, 999, 999, 0, 0, -1, 0, 999, 1, 1, 1, 1, 999, -1, 0

Conditional imputation with LOCF

阅读更多关于 Conditional imputation with LOCF

multinominal regression with imputed data

阅读更多关于 multinominal regression with imputed data

问题 I need to impute missing data and then coduct multinomial regression with the generated datasets. I have tried using mice for the imputing and then multinom function from nnet for the multnomial regression. But this gives me unreadable output. Here is an example using the nhanes2 dataset available with the mice package: library(mice) library(nnet) test <- mice(nhanes2, meth=c('sample','pmm','logreg','norm')) #age is categorical, bmi is continuous m <- with(test, multinom(age ~ bmi, model = T)

variable fillna() in each column

阅读更多关于 variable fillna() in each column

问题 For starters, here is some artificial data fitting my problem: df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)), columns = ["col_{}".format(x) for x in range(10)], index = range(0, vsize * 3, 3)) df_2 = pd.DataFrame(np.random.randint(0,100,size=(vsize, 10)), columns = ["col_{}".format(x) for x in range(10, 20, 1)], index = range(0, vsize * 2, 2)) df = df.merge(df_2, left_index = True, right_index = True, how = 'outer') df_tar = pd.DataFrame({"tar_1": [np.random.randint(0, 2) for

Error when trying to use imputed data for sem using “mi” package

阅读更多关于 Error when trying to use imputed data for sem using “mi” package

问题 I am trying to conduct a path model with my imputed data but I can't figure out how to get my code to work. A regular regression like this works fine with the pool function analysis <- pool(outcome1 ~ variable1 + variable2, data = imputations, m = NULL) But when I try to make it a path model it gives me errors. e.g., of code I've tried analysis <- pool(outcome1 + outcome2 ~ variable1 + variable2, data = imputations, m = NULL) Error in pool(outcome1 + outcome2 ~ variable1 + variable2, data =