imputation

Data imputation with fancyimpute and pandas

早过忘川 提交于 2019-11-30 12:07:21
问题 I have a large pandas data fame df . It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas and/or scikit unfortunately doens't do the trick). I came across what seems to be a neat package called fancyimpute (you can find it here). But I have some problems with it. Here is what I do: #the neccesary imports import pandas as pd import numpy as np from fancyimpute import KNN

Data imputation with fancyimpute and pandas

。_饼干妹妹 提交于 2019-11-30 02:06:03
I have a large pandas data fame df . It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas and/or scikit unfortunately doens't do the trick). I came across what seems to be a neat package called fancyimpute (you can find it here ). But I have some problems with it. Here is what I do: #the neccesary imports import pandas as pd import numpy as np from fancyimpute import KNN # df is my data frame with the missings. I keep only floats df_numeric = = df.select_dtypes(include=

Replace all NA with FALSE in selected columns in R

佐手、 提交于 2019-11-29 11:19:44
问题 I have a question similar to this one, but my dataset is a bit bigger: 50 columns with 1 column as UID and other columns carrying either TRUE or NA , I want to change all the NA to FALSE , but I don't want to use explicit loop. Can plyr do the trick? Thanks. UPDATE #1 Thanks for quick reply, but what if my dataset is like below: df <- data.frame( id = c(rep(1:19),NA), x1 = sample(c(NA,TRUE), 20, replace = TRUE), x2 = sample(c(NA,TRUE), 20, replace = TRUE) ) I only want X1 and X2 to be

How to replace NA (missing values) in a data frame with neighbouring values

血红的双手。 提交于 2019-11-28 16:54:31
862 2006-05-19 6.241603 5.774208 863 2006-05-20 NA NA 864 2006-05-21 NA NA 865 2006-05-22 6.383929 5.906426 866 2006-05-23 6.782068 6.268758 867 2006-05-24 6.534616 6.013767 868 2006-05-25 6.370312 5.856366 869 2006-05-26 6.225175 5.781617 870 2006-05-27 NA NA I have a data frame x like above with some NA, which i want to fill using neighboring non-NA values like for 2006-05-20 it will be avg of 19&22 How do it is the question? Properly formatted your data looks like this 862 2006-05-19 6.241603 5.774208 863 2006-05-20 NA NA 864 2006-05-21 NA NA 865 2006-05-22 6.383929 5.906426 866 2006-05-23

how to insert missing observations on a data frame

柔情痞子 提交于 2019-11-28 01:42:51
I have a data that are observations over time. Unfortunately, some large gaps of time points are missing on a treatment. They are not coded as NA and if I make a plot out of them it becomes apparent. My data frame looks like this. The number of samples per time points are irregular. (edit: sorry for not making the example reproducible)s structure(list(A = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,

R: replace NA with item from vector

霸气de小男生 提交于 2019-11-27 14:07:37
问题 I am trying to replace some missing values in my data with the average values from a similar group. My data looks like this: X Y 1 x y 2 x y 3 NA y 4 x y And I want it to look like this: X Y 1 x y 2 x y 3 y y 4 x y I wrote this, and it worked for(i in 1:nrow(data.frame){ if( is.na(data.frame$X[i]) == TRUE){ data.frame$X[i] <- data.frame$Y[i] } } But my data.frame is almost half a million lines long, and the for/if statements are pretty slow. What I want is something like is.na(data.frame$X) <

How to replace NA (missing values) in a data frame with neighbouring values

会有一股神秘感。 提交于 2019-11-27 10:10:16
问题 862 2006-05-19 6.241603 5.774208 863 2006-05-20 NA NA 864 2006-05-21 NA NA 865 2006-05-22 6.383929 5.906426 866 2006-05-23 6.782068 6.268758 867 2006-05-24 6.534616 6.013767 868 2006-05-25 6.370312 5.856366 869 2006-05-26 6.225175 5.781617 870 2006-05-27 NA NA I have a data frame x like above with some NA, which i want to fill using neighboring non-NA values like for 2006-05-20 it will be avg of 19&22 How do it is the question? 回答1: Properly formatted your data looks like this 862 2006-05-19

Impute categorical missing values in scikit-learn

会有一股神秘感。 提交于 2019-11-27 10:09:38
I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer (replacing NaN by the most frequent value). The problem is in implementation. Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature. Once I run: from sklearn.preprocessing import Imputer imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0) imp.fit(df) Python generates an error: 'could not convert string to float: 'run1'' , where 'run1' is an ordinary

Replace missing values with mean - Spark Dataframe

别来无恙 提交于 2019-11-27 09:05:09
I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far: a) To do this for a single column (let's say Col A), this line of code seems to work: df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA")) .first()(0).asInstanceOf[Double]) .otherwise($"ColA")) b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying

Replace missing values with column mean

蓝咒 提交于 2019-11-26 16:05:59
I am not sure how to loop over each column to replace the NA values with the column mean. When I am trying to replace for one column using the following, it works well. Column1[is.na(Column1)] <- round(mean(Column1, na.rm = TRUE)) The code for looping over columns is not working: for(i in 1:ncol(data)){ data[i][is.na(data[i])] <- round(mean(data[i], na.rm = TRUE)) } the values are not replaced. Can someone please help me with this? A relatively simple modification of your code should solve the issue: for(i in 1:ncol(data)){ data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE) } If DF is