imputation

Generate larger synthetic dataset based on a smaller dataset in Python

为君一笑 提交于 2020-01-02 08:50:15
问题 I have a dataset with 21000 rows (data samples) and 102 columns (features). I would like to have a larger synthetic dataset generated based on the current dataset, say with 100000 rows, so I can use it for machine learning purposes thereby. I've been referring to the answer by @Prashant on this post https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data, but am unable to get it working on generating a larger synthetic dataset for my data. import numpy as

Missing values in Time Series in python

南楼画角 提交于 2020-01-01 04:56:06
问题 I have a time series dataframe, the dataframe is quite big and contain some missing values in the 2 columns('Humidity' and 'Pressure'). I would like to impute this missing values in a clever way, for example using the value of the nearest neighbor or the average of the previous and following timestamp.Is there an easy way to do it? I have tried with fancyimpute but the dataset contain around 180000 examples and give a memory error 回答1: Consider interpolate (documentation). This example shows

Imputer on some Dataframe columns in Python

倖福魔咒の 提交于 2020-01-01 04:43:07
问题 I am learning how to use Imputer on Python. This is my code: df=pd.DataFrame([["XXL", 8, "black", "class 1", 22], ["L", np.nan, "gray", "class 2", 20], ["XL", 10, "blue", "class 2", 19], ["M", np.nan, "orange", "class 1", 17], ["M", 11, "green", "class 3", np.nan], ["M", 7, "red", "class 1", 22]]) df.columns=["size", "price", "color", "class", "boh"] from sklearn.preprocessing import Imputer imp=Imputer(missing_values="NaN", strategy="mean" ) imp.fit(df["price"]) df["price"]=imp.transform(df[

Predicting missing values with scikit-learn's Imputer module

那年仲夏 提交于 2020-01-01 01:39:09
问题 I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class. I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array. When I print the array after performing fit_transform(), the 'Nan's remain, and I dont get any prediction. What am I doing wrong here? How do I go about predicting the missing values? import numpy as np from sklearn.preprocessing import Imputer X = np.array([[23

Marginal effects with survey weights and multiple imputations

孤人 提交于 2019-12-24 12:15:11
问题 I am working with survey data that use probability weights and multiple imputations. I would like to get marginal effects after estimating a logit model using the imputed data sets and the survey weights. I cannot figure out how to do this in R. Stata has the package mimrgns which makes it pretty easy. There is also this article (pdf) and supplementary material (pdf) that gives some direction, but I can't seem to apply it to my situation. In the following example, please assume I already

Conditional imputation with LOCF

混江龙づ霸主 提交于 2019-12-24 06:54:53
问题 I've this example of longitudinal data. I need to impute 0, 999 or -1 values according to what occurs before. ID = c(1,1,1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,6,6,6,6) Oxy = c(0, 999, 1, 999, 999, 0, 0, 999, 999, 0, 0, -1, 0, 999, 1, 1, -1, 1, 999, -1, 0, -1, 1,0, 999, 0) Y = c(2010,2011,2012,2013,2014,2011,2012,2013,2010,2011,2012,2010,2011, 2012,2010,2011,2012,2013,2014,2015,2016,2017, 2018,2019,2020, 2021) Oxy2 = c(0, 999, 1, 1, 1, 0, 0, 999, 999, 0, 0, -1, 0, 999, 1, 1, 1, 1, 999, -1, 0

Conditional imputation with LOCF

那年仲夏 提交于 2019-12-24 06:53:45
问题 I've this example of longitudinal data. I need to impute 0, 999 or -1 values according to what occurs before. ID = c(1,1,1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,6,6,6,6) Oxy = c(0, 999, 1, 999, 999, 0, 0, 999, 999, 0, 0, -1, 0, 999, 1, 1, -1, 1, 999, -1, 0, -1, 1,0, 999, 0) Y = c(2010,2011,2012,2013,2014,2011,2012,2013,2010,2011,2012,2010,2011, 2012,2010,2011,2012,2013,2014,2015,2016,2017, 2018,2019,2020, 2021) Oxy2 = c(0, 999, 1, 1, 1, 0, 0, 999, 999, 0, 0, -1, 0, 999, 1, 1, 1, 1, 999, -1, 0

multinominal regression with imputed data

耗尽温柔 提交于 2019-12-24 06:21:32
问题 I need to impute missing data and then coduct multinomial regression with the generated datasets. I have tried using mice for the imputing and then multinom function from nnet for the multnomial regression. But this gives me unreadable output. Here is an example using the nhanes2 dataset available with the mice package: library(mice) library(nnet) test <- mice(nhanes2, meth=c('sample','pmm','logreg','norm')) #age is categorical, bmi is continuous m <- with(test, multinom(age ~ bmi, model = T)

variable fillna() in each column

主宰稳场 提交于 2019-12-23 21:30:08
问题 For starters, here is some artificial data fitting my problem: df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)), columns = ["col_{}".format(x) for x in range(10)], index = range(0, vsize * 3, 3)) df_2 = pd.DataFrame(np.random.randint(0,100,size=(vsize, 10)), columns = ["col_{}".format(x) for x in range(10, 20, 1)], index = range(0, vsize * 2, 2)) df = df.merge(df_2, left_index = True, right_index = True, how = 'outer') df_tar = pd.DataFrame({"tar_1": [np.random.randint(0, 2) for

Error when trying to use imputed data for sem using “mi” package

99封情书 提交于 2019-12-23 05:15:47
问题 I am trying to conduct a path model with my imputed data but I can't figure out how to get my code to work. A regular regression like this works fine with the pool function analysis <- pool(outcome1 ~ variable1 + variable2, data = imputations, m = NULL) But when I try to make it a path model it gives me errors. e.g., of code I've tried analysis <- pool(outcome1 + outcome2 ~ variable1 + variable2, data = imputations, m = NULL) Error in pool(outcome1 + outcome2 ~ variable1 + variable2, data =