categorical-data | 易学教程

Handling NULL values in Spark StringIndexer

阅读更多关于 Handling NULL values in Spark StringIndexer

问题 I have a dataset with some categorical string columns and I want to represent them in double type. I used StringIndexer for this convertion and It works but when I tried it in another dataset that has NULL values it gave java.lang.NullPointerException error and did not work. For better understanding here is my code: for(col <- cols){ out_name = col ++ "_" var indexer = new StringIndexer().setInputCol(col).setOutputCol(out_name) var indexed = indexer.fit(df).transform(df) df = (indexed

Automatically use LRT to assess significance of entire factor variable

阅读更多关于 Automatically use LRT to assess significance of entire factor variable

问题 R's output for a multivariable regression model including one or more factor variable does not automatically include a likelihood ratio test (LRT) of the significance of the entire factor variable in the model. For example: fake = data.frame( x1=rnorm(100), x2=sample(LETTERS[1:4], size=100, replace=TRUE), y=rnorm(100) ) head(fake) x1 x2 y 1 0.6152511 A 0.7682467 2 -0.8215727 A -0.5389245 3 -1.3287208 A -0.1797851 4 0.5837217 D 0.9509888 5 -0.2828024 C -0.9829126 6 0.3971358 B -0.4895091 m =

Tensorflow embedding for categorical feature

阅读更多关于 Tensorflow embedding for categorical feature

问题 In machine learning, it is common to represent a categorical (specifically: nominal) feature with one-hot-encoding. I am trying to learn how to use tensorflow's embedding layer to represent a categorical feature in a classification problem. I have got tensorflow version 1.01 installed and I am using Python 3.6 . I am aware of the tensorflow tutorial for word2vec, but it is not very instructive for my case. While building the tf.Graph , it uses NCE-specific weights and tf.nn.nce_loss . I just

How to legend a raster using directly the raster attribute table and displaying the legend only for class displayed in the raster?

阅读更多关于 How to legend a raster using directly the raster attribute table and displaying the legend only for class displayed in the raster?

问题 I would like to use the raster attribute table information to create the legend of a raster such as the raster 1 and display the legend only for the class displayed in the raster. I build an example to explain what I would like to get. 1/ Build the raster r <- raster(ncol=10, nrow=10) values(r) <-sample(1:3,ncell(r),replace=T) 2/ Add the Raster Attribute Table r <- ratify(r) # build the Raster Attibute table rat <- levels(r)[[1]]#get the values of the unique cell frot the attribute table rat

How do I make a boxplot with two categorical variables in R? [closed]

阅读更多关于 How do I make a boxplot with two categorical variables in R? [closed]

问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 5 years ago . I would like to make a boxplot that shows how time spent doing a behaviour(Alert) is affected by two variables (Period= Morning/Afternoon and Visitor Level= High/Low). Alert ~ Period + Vis.Level 'Alert' is a set of 12 numbers that show the amount of time spent awake with the other two as the

“Automatically” calculate linear combination of parameter estimates with PROC GLM

阅读更多关于 “Automatically” calculate linear combination of parameter estimates with PROC GLM

问题 Background : I have a categorical variable, X , with four levels that I fit as separate dummy variables. Thus, there are three total dummy variables representing x=1, x=2, x=3 (x=0 is baseline). Problem/issue : I want to be able to calculate the value of a linear combination (i.e. using SAS as a calculator) of these dummy variables. For example, 2*B1 + 2*B2 + B3. In Stata, this can be done using the lincom command, which uses the stored beta estimates to calculate linear combinations of the

Automatically compare nested models from mice's glm.mids

阅读更多关于 Automatically compare nested models from mice's glm.mids

问题 I have a multiply-imputed model from R's mice package in which there are lots of factor variables. For example: library(mice) library(Hmisc) # turn all the variables into factors fake = nhanes fake$age = as.factor(nhanes$age) fake$bmi = cut2(nhanes$bmi, g=3) fake$chl = cut2(nhanes$chl, g=3) head(fake) age bmi hyp chl 1 1 <NA> NA <NA> 2 2 [20.4,25.5) 1 [187,206) 3 1 <NA> 1 [187,206) 4 3 <NA> NA <NA> 5 1 [20.4,25.5) 1 [113,187) 6 3 <NA> NA [113,187) imput = mice(nhanes) # big model fit1 = glm

Plotting two categorical arrays in a histogram/bar chart?

阅读更多关于 Plotting two categorical arrays in a histogram/bar chart?

问题 I have a categorical array, race, and an array of yes/no, and I want to somehow create a stacked bar/histogram plot with each race having its own bar and each bar is broken up into two different colors - one for the respondents that said yes, and the others for the ones that said no. Is there any way to do this relatively simply in MATLAB? And is there a way at least create a table that shows for each race, how many said yes, how many said no? To clarify, there are 1250 rows in my data set,

Lexical dispersion plot is seaborn

阅读更多关于 Lexical dispersion plot is seaborn

问题 I am using the seaborn module to produce a plot similar to the example below. import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns location = "/global/scratch/umalmonj/WRF/juris/golden_hourly_manual_obs.csv" df = pd.read_csv(location,usecols= ["Year","Month","Day","Time","Weather"],parse_dates=[["Year","Month","Day","Time"]]) I have a df that looks like: Year_Month_Day_Time Weather 0 2010-01-01 00:00:00 NaN 1 2010-01-01 01:00:00 NaN 2 2010-01-01 02:00

Meaning of “trait” in MCMCglmm

阅读更多关于 Meaning of “trait” in MCMCglmm

问题 Like in this post I'm struggling with the notation of MCMCglmm , especially what is meant by trait . My code ist the following library("MCMCglmm") set.seed(123) y <- sample(letters[1:3], size = 100, replace = TRUE) x <- rnorm(100) id <- rep(1:10, each = 10) dat <- data.frame(y, x, id) mod <- MCMCglmm(fixed = y ~ x, random = ~us(x):id, data = dat, family = "categorical") Which gives me the error message For error structures involving catgeorical data with more than 2 categories pleasue use