imputation | 易学教程

Data imputation with fancyimpute and pandas

阅读更多关于 Data imputation with fancyimpute and pandas

问题 I have a large pandas data fame df . It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas and/or scikit unfortunately doens't do the trick). I came across what seems to be a neat package called fancyimpute (you can find it here). But I have some problems with it. Here is what I do: #the neccesary imports import pandas as pd import numpy as np from fancyimpute import KNN

Data imputation with fancyimpute and pandas

阅读更多关于 Data imputation with fancyimpute and pandas

I have a large pandas data fame df . It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas and/or scikit unfortunately doens't do the trick). I came across what seems to be a neat package called fancyimpute (you can find it here ). But I have some problems with it. Here is what I do: #the neccesary imports import pandas as pd import numpy as np from fancyimpute import KNN # df is my data frame with the missings. I keep only floats df_numeric = = df.select_dtypes(include=

Replace all NA with FALSE in selected columns in R

阅读更多关于 Replace all NA with FALSE in selected columns in R

问题 I have a question similar to this one, but my dataset is a bit bigger: 50 columns with 1 column as UID and other columns carrying either TRUE or NA , I want to change all the NA to FALSE , but I don't want to use explicit loop. Can plyr do the trick? Thanks. UPDATE #1 Thanks for quick reply, but what if my dataset is like below: df <- data.frame( id = c(rep(1:19),NA), x1 = sample(c(NA,TRUE), 20, replace = TRUE), x2 = sample(c(NA,TRUE), 20, replace = TRUE) ) I only want X1 and X2 to be

How to replace NA (missing values) in a data frame with neighbouring values

阅读更多关于 How to replace NA (missing values) in a data frame with neighbouring values

862 2006-05-19 6.241603 5.774208 863 2006-05-20 NA NA 864 2006-05-21 NA NA 865 2006-05-22 6.383929 5.906426 866 2006-05-23 6.782068 6.268758 867 2006-05-24 6.534616 6.013767 868 2006-05-25 6.370312 5.856366 869 2006-05-26 6.225175 5.781617 870 2006-05-27 NA NA I have a data frame x like above with some NA, which i want to fill using neighboring non-NA values like for 2006-05-20 it will be avg of 19&22 How do it is the question? Properly formatted your data looks like this 862 2006-05-19 6.241603 5.774208 863 2006-05-20 NA NA 864 2006-05-21 NA NA 865 2006-05-22 6.383929 5.906426 866 2006-05-23

how to insert missing observations on a data frame

阅读更多关于 how to insert missing observations on a data frame

I have a data that are observations over time. Unfortunately, some large gaps of time points are missing on a treatment. They are not coded as NA and if I make a plot out of them it becomes apparent. My data frame looks like this. The number of samples per time points are irregular. (edit: sorry for not making the example reproducible)s structure(list(A = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,

R: replace NA with item from vector

阅读更多关于 R: replace NA with item from vector

问题 I am trying to replace some missing values in my data with the average values from a similar group. My data looks like this: X Y 1 x y 2 x y 3 NA y 4 x y And I want it to look like this: X Y 1 x y 2 x y 3 y y 4 x y I wrote this, and it worked for(i in 1:nrow(data.frame){ if( is.na(data.frame$X[i]) == TRUE){ data.frame$X[i] <- data.frame$Y[i] } } But my data.frame is almost half a million lines long, and the for/if statements are pretty slow. What I want is something like is.na(data.frame$X) <

How to replace NA (missing values) in a data frame with neighbouring values

阅读更多关于 How to replace NA (missing values) in a data frame with neighbouring values

问题 862 2006-05-19 6.241603 5.774208 863 2006-05-20 NA NA 864 2006-05-21 NA NA 865 2006-05-22 6.383929 5.906426 866 2006-05-23 6.782068 6.268758 867 2006-05-24 6.534616 6.013767 868 2006-05-25 6.370312 5.856366 869 2006-05-26 6.225175 5.781617 870 2006-05-27 NA NA I have a data frame x like above with some NA, which i want to fill using neighboring non-NA values like for 2006-05-20 it will be avg of 19&22 How do it is the question? 回答1: Properly formatted your data looks like this 862 2006-05-19

Impute categorical missing values in scikit-learn

阅读更多关于 Impute categorical missing values in scikit-learn

I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer (replacing NaN by the most frequent value). The problem is in implementation. Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature. Once I run: from sklearn.preprocessing import Imputer imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0) imp.fit(df) Python generates an error: 'could not convert string to float: 'run1'' , where 'run1' is an ordinary

Replace missing values with mean - Spark Dataframe

阅读更多关于 Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far: a) To do this for a single column (let's say Col A), this line of code seems to work: df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA")) .first()(0).asInstanceOf[Double]) .otherwise($"ColA")) b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying

Replace missing values with column mean

阅读更多关于 Replace missing values with column mean

I am not sure how to loop over each column to replace the NA values with the column mean. When I am trying to replace for one column using the following, it works well. Column1[is.na(Column1)] <- round(mean(Column1, na.rm = TRUE)) The code for looping over columns is not working: for(i in 1:ncol(data)){ data[i][is.na(data[i])] <- round(mean(data[i], na.rm = TRUE)) } the values are not replaced. Can someone please help me with this? A relatively simple modification of your code should solve the issue: for(i in 1:ncol(data)){ data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE) } If DF is