missing-data

R: variable exclusion from formula not working in presence of missing data

狂风中的少年 提交于 2021-01-28 14:12:49
问题 I'm building a model in R, while excluding 'office' column in the formula (it sometimes contains hints of the class I predict ). I'm learning on 'train' and predicting on 'test': > model <- randomForest::randomForest(tc ~ . - office, data=train, importance=TRUE,proximity=TRUE ) > prediction <- predict(model, test, type = "class") the prediction resulted with all NAs: > head(prediction) [1] <NA> <NA> <NA> <NA> <NA> <NA> Levels: 2668 2752 2921 3005 the reason is that test$office contains NAs: >

How to combine duplicate rows in pandas?

旧时模样 提交于 2021-01-28 12:11:25
问题 How to combine duplicate rows in pandas, filling in missing values? In the example below, some rows have missing values in the c1 column, but the c2 column has duplicates that can be used as an index to look up and fill in those missing values. the input data looks like this: c1 c2 id 0 10.0 a 1 NaN b 2 30.0 c 3 10.0 a 4 20.0 b 5 NaN c desired output: c1 c2 0 10 a 1 20 b 2 30 c But how to do this? Here is the code to generate the example data: import pandas as pd df = pd.DataFrame({ 'c1': [10

How to efficiently extrapolate missing data for multiple variables

最后都变了- 提交于 2021-01-27 15:13:00
问题 I have panel data and numerous variables are missing observations before certain years. The years vary across variables. What is an efficient way to extrapolate for missing data points across multiple columns? I'm thinking of something as simple as extrapolation from a linear trend, but I'm hoping to find an efficient way to apply the prediction to multiple columns. Below is a sample data set with missingness similar to what I'm dealing with. In this example, I'm hoping to fill in the NA

Pandas read_csv, reading a boolean with missing values specified as an int

只愿长相守 提交于 2021-01-27 07:10:32
问题 I am trying to import a csv into a pandas dataframe. I have boolean variables denoted with 1's and 0's, where missing values are identified with a -9. When I try to specify the dtype as boolean, I get a host of different errors, depending on what I try. Sample data: test.csv var1, var2 0, 0 0, 1 1, 3 -9, 0 0, 2 1, 7 I try to specify the dtype as I import: dtype_dict = {'var1':'bool','var2':'int'} nan_dict = {'var1':[-9]} foo = pd.read_csv('test.csv',dtype=dtype_dict, na_values=nan_dict) I get

Pandas: Filling data for missing dates

北城以北 提交于 2021-01-27 06:29:58
问题 Let's say I've got the following table: ProdID Date Val1 Val2 Val3 Prod1 4/1/2019 1 3 4 Prod1 4/3/2019 2 3 54 Prod1 4/4/2019 3 4 54 Prod2 4/1/2019 1 3 3 Prod2 4/2/2019 1 3 4 Prod2 4/3/2019 2 4 4 Prod2 4/4/2019 2 5 3 Prod2 entries are populated correctly as we've got the data from 4/1/2019 to 4/4/2019 . Prod1 has 1 missing date - 4/2/2019 . I would like to find missing dates for all ProdIDs and fill in Val1-3 with data copied from the last of previous entry. For instance, I would like to copy

Random slope for time in subject not working in lme4

寵の児 提交于 2021-01-20 18:39:09
问题 I can not insert a random slope in this model with lme4(1.1-7): > difJS<-lmer(JS~Tempo+(Tempo|id),dat,na.action=na.omit) Error: number of observations (=274) <= number of random effects (=278) for term (Tempo | id); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable With nlme it is working: > JSprova<-lme(JS~Tempo,random=~1+Tempo|id,data=dat,na.action=na.omit) > summary(JSprova) Linear mixed-effects model fit by REML Data: dat AIC BIC

fill in missing data for group by unique ID [duplicate]

不羁岁月 提交于 2021-01-05 07:06:44
问题 This question already has answers here : Filling missing value in group (3 answers) Closed 20 days ago . My clinical data structure looks like this: patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104) group <- c(1,1,NA,2,NA,NA,1,1,1,2,2,NA) Data<- data.frame(patientid=patientid,group=group) If there is missing data then the NA should become the same value as the other group value for the same patient id. In other words a patient is always in the same group and the missing data

Replace NA in a series of variables with different types of missing

爱⌒轻易说出口 提交于 2021-01-01 09:20:31
问题 This is my data. # A tibble: 10 x 6 id main s_0 s_1 s_2 s_3 <dbl> <fct> <fct> <fct> <fct> <fct> 1 1 5 75 A 4 110 2 2 NA NA NA NA NA 3 3 11 13 NA 7 769 4 4 NA NA NA NA NA 5 5 11 NA NA NA 835 6 6 13 39 NA 4 NA 7 7 NA NA NA NA NA 8 8 19 42 D 6 654 9 9 20 4 NA 7 577 10 10 NA NA NA NA NA As you can see, the column main indicates that rows in the other columns (s_0: s_4) answered the questions or not. Ids 2,4,7 and 10 were not eligible for the rest, however, other participants can answer or miss (s

Filling Missing sales value with zero and calculate 3 month average in PySpark

我的未来我决定 提交于 2020-12-26 04:31:33
问题 I Want add missing values with zero sales and calculate 3 month average in pyspark My Input : product specialty date sales A pharma 1/3/2019 50 A pharma 1/4/2019 60 A pharma 1/5/2019 70 A pharma 1/8/2019 80 A ENT 1/8/2019 50 A ENT 1/9/2019 65 A ENT 1/11/2019 40 my output: product specialty date sales 3month_avg_sales A pharma 1/3/2019 50 16.67 A pharma 1/4/2019 60 36.67 A pharma 1/5/2019 70 60 A pharma 1/6/2019 0 43.33 A pharma 1/7/2019 0 23.33 A pharma 1/8/2019 80 26.67 A ENT 1/8/2019 50 16

Replace dots in a float column with nan in Python

試著忘記壹切 提交于 2020-12-25 18:15:54
问题 I have a data frame df like this df = pd.DataFrame([ {'Name': 'Chris', 'Item Purchased': 'Sponge', 'Cost': 22.50}, {'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': '.........'}, {'Name': 'Filip', 'Item Purchased': 'Spoon', 'Cost': '...'}], index=['Store 1', 'Store 1', 'Store 2']) I want to replace the missing values in 'Cost' columns to np.nan . So far I have tried: df['Cost']=df['Cost'].str.replace("\.\.+", np.nan) and df['Cost']=re.sub('\.\.+',np.nan,df['Cost']) but neither of