categorical-data | 易学教程

Generate “special” dictionary structure from just columns index in tsv

阅读更多关于 Generate “special” dictionary structure from just columns index in tsv

问题 Imagine a tab separated file like this one: 9606 1 GO:0002576 TAS - platelet degranulation - Process 9606 1 GO:0003674 ND - molecular_function_z - Function 9606 1 GO:0003674 OOO - molecular_function_z - Function 9606 1 GO:0005576 IDA - extracellular region - Component 9606 1 GO:0005576 TAS - extracellular region - Component 9606 1 GO:0005576 OOO - extracellular region - Component 9606 1 GO:0005615 HDA - extracellular spaces - Component 9606 1 GO:0008150 ND - biological_processes - Process

Mapping pandas dataframe column to a dictionary

阅读更多关于 Mapping pandas dataframe column to a dictionary

问题 I have a case of a dataframe containing a categorical variable of high cardinality (many unique values). I would like to re-code that variable to a set of values (the top most frequent values) and replace all other values with a catch-all category ("others"). To give a simple example: Here are the two values which should stay unchanged: top_values = ['apple', 'orange'] I established them based on their frequency in the following dataframe column: {'fruits': {0: 'apple', 1: 'apple', 2: 'orange

Add horizontal lines in categorical scatter plot using ggplot2 in R

阅读更多关于 Add horizontal lines in categorical scatter plot using ggplot2 in R

问题 I am trying to plot a simple scatter plot for 3 groups, with different horizontal lines (line segment) for each group: for instance a hline at 3 for group "a", a hline at 2.5 for group "b" and a hline at 6 for group "c". library(ggplot2) df <- data.frame(tt = rep(c("a","b","c"),40), val = round(rnorm(120, m = rep(c(4, 5, 7), each = 40)))) ggplot(df, aes(tt, val))+ geom_jitter(aes(tt, val), data = df, colour = I("red"), position = position_jitter(width = 0.05)) I really appreciate your help!

Find frequencies over 3rd quartile in table

阅读更多关于 Find frequencies over 3rd quartile in table

问题 I have a big data frame (+239k observations on 57 variables) with some sickness descriptions and medicines administered to those sicknesses for people in different age ranges. I'd like to find those medicines in the top quartile of frequency use for each sickness description. To make a reproducible example, I created a 1000 observations data frame: set.seed(1);sk<-as.factor(sample(c("sick A","sick B","sick C","sick D"),1000,replace=T));md<-as.factor(sample(c("med 1","med 2","med 3","med 4",

R change categorical data to dummy variables

阅读更多关于 R change categorical data to dummy variables

问题 I have a multi-variant data frame and want to convert the categorical data inside to dummy variables, I used model.matrix but it does not quite work. Please refer to the example below: age = c(1:15) #numeric sex = c(rep(0,7),rep(1,8)); sex = as.factor(sex) #factor bloodtype = c(rep('A',2),rep('B',8),rep('O',1),rep('AB',4));bloodtype = as.factor(bloodtype) #factor bodyweight = c(11:25) #numeric wholedata = data.frame(cbind(age,sex,bloodtype,bodyweight)) model.matrix(~.,data=wholedata)[,-1] The

Breaking a continuous variable into categories using dplyr and/or cut

阅读更多关于 Breaking a continuous variable into categories using dplyr and/or cut

问题 I have a dataset that is a record of price changes, among other variables. I would like to mutate the price column into a categorical variable. I understand that the two functions of importance here in R seem to be dplyr and/or cut . > head(btc_data) time btc_price 1 2017-08-27 22:50:00 4,389.6113 2 2017-08-27 22:51:00 4,389.0850 3 2017-08-27 22:52:00 4,388.8625 4 2017-08-27 22:53:00 4,389.7888 5 2017-08-27 22:56:00 4,389.9138 6 2017-08-27 22:57:00 4,390.1663 >dput(btc_data) ("4,972.0700", "4

Matplotlib cannot plot categorical values

阅读更多关于 Matplotlib cannot plot categorical values

问题 Here is my example: import matplotlib.pyplot as plt test_list = ['a', 'b', 'b', 'c'] plt.hist(test_list) plt.show() It generates the following error message: TypeError Traceback (most recent call last) <ipython-input-48-228f7f5e9d1e> in <module>() 1 test_list = ['a', 'b', 'b', 'c'] ----> 2 plt.hist(test_list) 3 plt.show() C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py in hist(x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label,

Categorising numerical and categorical variables into appropriate ranges in R

阅读更多关于 Categorising numerical and categorical variables into appropriate ranges in R

问题 Df <- bball5 str(bball5) 'data.frame': 379 obs. of 9 variables: $ ID : int 238 239 240 241 242 243 244 245 246 247 ... $ Sex : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ... $ Sport : Factor w/ 10 levels "BBall","Field",..: 1 1 1 1 1 1 1 1 1 1 $ Ht : num 196 190 178 185 185 ... $ Wt : num 78.9 74.4 69.1 74.9 64.6 63.7 75.2 62.3 66.5 62.9 ... $ BMI : num 20.6 20.7 21.9 21.9 19 ... $ BMIc : NA NA NA NA NA NA NA NA NA NA ... $ Sex_f : Factor w/ 1 level "female": 1 1 1 1 1 1 1 1 1 1

R ifelse changed factor value into index

阅读更多关于 R ifelse changed factor value into index

问题 I met a weird problem when I am using R, I'm using data.table: Here, when I tried to convert those Province has count under 500 to "Other", the output changes the top count Provinces into index number df <- fact_data[,.N,Province][N >= 500]$Province df fact_data[,Province := ifelse(Province %in% df, fact_data$Province, "Other")] fact_data[,.N,Province][order(-N)] Output: But, this method worked well on those factor variables which values are in numeric format. For example, instead of using

replace missing values in categorical data

阅读更多关于 replace missing values in categorical data

问题 Let's suppose I have a column with categorical data "red" "green" "blue" and empty cells red green red blue NaN I'm sure that the NaN belongs to red green blue, should I replace the NaN by the average of the colors or is a too strong assumption? It will be col1 | col2 | col3 1 0 0 0 1 0 1 0 0 0 0 1 0.5 0.25 0.25 Or even scale the last row but keeping the ratio so these values have less influence? Usually what is the best practice? 0.25 0.125 0.125 回答1: It depends on what you want to do with