categorical-data

Generate “special” dictionary structure from just columns index in tsv

徘徊边缘 提交于 2019-12-12 19:09:35
问题 Imagine a tab separated file like this one: 9606 1 GO:0002576 TAS - platelet degranulation - Process 9606 1 GO:0003674 ND - molecular_function_z - Function 9606 1 GO:0003674 OOO - molecular_function_z - Function 9606 1 GO:0005576 IDA - extracellular region - Component 9606 1 GO:0005576 TAS - extracellular region - Component 9606 1 GO:0005576 OOO - extracellular region - Component 9606 1 GO:0005615 HDA - extracellular spaces - Component 9606 1 GO:0008150 ND - biological_processes - Process

Mapping pandas dataframe column to a dictionary

痴心易碎 提交于 2019-12-12 13:43:40
问题 I have a case of a dataframe containing a categorical variable of high cardinality (many unique values). I would like to re-code that variable to a set of values (the top most frequent values) and replace all other values with a catch-all category ("others"). To give a simple example: Here are the two values which should stay unchanged: top_values = ['apple', 'orange'] I established them based on their frequency in the following dataframe column: {'fruits': {0: 'apple', 1: 'apple', 2: 'orange

Add horizontal lines in categorical scatter plot using ggplot2 in R

我的梦境 提交于 2019-12-12 13:24:47
问题 I am trying to plot a simple scatter plot for 3 groups, with different horizontal lines (line segment) for each group: for instance a hline at 3 for group "a", a hline at 2.5 for group "b" and a hline at 6 for group "c". library(ggplot2) df <- data.frame(tt = rep(c("a","b","c"),40), val = round(rnorm(120, m = rep(c(4, 5, 7), each = 40)))) ggplot(df, aes(tt, val))+ geom_jitter(aes(tt, val), data = df, colour = I("red"), position = position_jitter(width = 0.05)) I really appreciate your help!

Find frequencies over 3rd quartile in table

帅比萌擦擦* 提交于 2019-12-12 12:33:44
问题 I have a big data frame (+239k observations on 57 variables) with some sickness descriptions and medicines administered to those sicknesses for people in different age ranges. I'd like to find those medicines in the top quartile of frequency use for each sickness description. To make a reproducible example, I created a 1000 observations data frame: set.seed(1);sk<-as.factor(sample(c("sick A","sick B","sick C","sick D"),1000,replace=T));md<-as.factor(sample(c("med 1","med 2","med 3","med 4",

R change categorical data to dummy variables

纵然是瞬间 提交于 2019-12-12 04:58:00
问题 I have a multi-variant data frame and want to convert the categorical data inside to dummy variables, I used model.matrix but it does not quite work. Please refer to the example below: age = c(1:15) #numeric sex = c(rep(0,7),rep(1,8)); sex = as.factor(sex) #factor bloodtype = c(rep('A',2),rep('B',8),rep('O',1),rep('AB',4));bloodtype = as.factor(bloodtype) #factor bodyweight = c(11:25) #numeric wholedata = data.frame(cbind(age,sex,bloodtype,bodyweight)) model.matrix(~.,data=wholedata)[,-1] The

Breaking a continuous variable into categories using dplyr and/or cut

三世轮回 提交于 2019-12-12 04:54:23
问题 I have a dataset that is a record of price changes, among other variables. I would like to mutate the price column into a categorical variable. I understand that the two functions of importance here in R seem to be dplyr and/or cut . > head(btc_data) time btc_price 1 2017-08-27 22:50:00 4,389.6113 2 2017-08-27 22:51:00 4,389.0850 3 2017-08-27 22:52:00 4,388.8625 4 2017-08-27 22:53:00 4,389.7888 5 2017-08-27 22:56:00 4,389.9138 6 2017-08-27 22:57:00 4,390.1663 >dput(btc_data) ("4,972.0700", "4

Matplotlib cannot plot categorical values

断了今生、忘了曾经 提交于 2019-12-12 04:17:08
问题 Here is my example: import matplotlib.pyplot as plt test_list = ['a', 'b', 'b', 'c'] plt.hist(test_list) plt.show() It generates the following error message: TypeError Traceback (most recent call last) <ipython-input-48-228f7f5e9d1e> in <module>() 1 test_list = ['a', 'b', 'b', 'c'] ----> 2 plt.hist(test_list) 3 plt.show() C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py in hist(x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label,

Categorising numerical and categorical variables into appropriate ranges in R

梦想与她 提交于 2019-12-12 04:16:10
问题 Df <- bball5 str(bball5) 'data.frame': 379 obs. of 9 variables: $ ID : int 238 239 240 241 242 243 244 245 246 247 ... $ Sex : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ... $ Sport : Factor w/ 10 levels "BBall","Field",..: 1 1 1 1 1 1 1 1 1 1 $ Ht : num 196 190 178 185 185 ... $ Wt : num 78.9 74.4 69.1 74.9 64.6 63.7 75.2 62.3 66.5 62.9 ... $ BMI : num 20.6 20.7 21.9 21.9 19 ... $ BMIc : NA NA NA NA NA NA NA NA NA NA ... $ Sex_f : Factor w/ 1 level "female": 1 1 1 1 1 1 1 1 1 1

R ifelse changed factor value into index

家住魔仙堡 提交于 2019-12-12 01:39:26
问题 I met a weird problem when I am using R, I'm using data.table: Here, when I tried to convert those Province has count under 500 to "Other", the output changes the top count Provinces into index number df <- fact_data[,.N,Province][N >= 500]$Province df fact_data[,Province := ifelse(Province %in% df, fact_data$Province, "Other")] fact_data[,.N,Province][order(-N)] Output: But, this method worked well on those factor variables which values are in numeric format. For example, instead of using

replace missing values in categorical data

可紊 提交于 2019-12-11 17:28:21
问题 Let's suppose I have a column with categorical data "red" "green" "blue" and empty cells red green red blue NaN I'm sure that the NaN belongs to red green blue, should I replace the NaN by the average of the colors or is a too strong assumption? It will be col1 | col2 | col3 1 0 0 0 1 0 1 0 0 0 0 1 0.5 0.25 0.25 Or even scale the last row but keeping the ratio so these values have less influence? Usually what is the best practice? 0.25 0.125 0.125 回答1: It depends on what you want to do with