categorical-data | 易学教程

Is there an advantage to ordering a categorical variable?

阅读更多关于 Is there an advantage to ordering a categorical variable?

I have been advised that it is best to order categorical variables where appropriate (e.g. short less than medium less than long). I am wondering, what is the specific advantage of treating a categorical variable as ordered as opposed to just simple categorical, in the context of modelling it as an explanatory variable? What does it mean mathematically (in lay terms preferably!)? Many thanks! Among other things, it allows you to compare values from those factors: > ord.fac <- ordered(c("small", "medium", "large"), levels=c("small", "medium", "large")) > fac <- factor(c("small", "medium",

A Faster Way of Removing Unused Categories in Pandas?

阅读更多关于 A Faster Way of Removing Unused Categories in Pandas?

问题 I'm running some models in Python, with data subset on categories. For memory usage, and preprocessing, all the categorical variables are stored as category data type. For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset. I am currently doing this using .cat.remove_unused_categories() , which is taking nearly 50% of my total runtime. At the moment, the worst

A Faster Way of Removing Unused Categories in Pandas?

阅读更多关于 A Faster Way of Removing Unused Categories in Pandas?

I'm running some models in Python, with data subset on categories. For memory usage, and preprocessing, all the categorical variables are stored as category data type. For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset. I am currently doing this using .cat.remove_unused_categories() , which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels

Can sklearn DecisionTreeClassifier truly work with categorical data?

阅读更多关于 Can sklearn DecisionTreeClassifier truly work with categorical data?

While working with the DecisionTreeClassifier I visualized it using graphviz , and I have to say, to my astonishment, it seems it takes categorical data and uses it as continuous data. All my features are categorical and for example you can see the following tree (please note that the first feature, X[0], has 6 possible values 0, 1, 2, 3, 4, 5: From what I found here the class uses a tree class which is a binary tree, so it is a limitation in sklearn. Anyone knows a way that I am missing to use the tree categorically? (I know it is not better for the task but as I need categories currently I

Matplotlib: how to plot categorical data on the y-axis?

阅读更多关于 Matplotlib: how to plot categorical data on the y-axis?

Let's say that I have the following code, which comes from here : gender = ['male','male','female','male','female'] import matplotlib.pyplot as plt from collections import Counter c = Counter(gender) men = c['male'] women = c['female'] bar_heights = (men, women) x = (1, 2) fig, ax = plt.subplots() width = 0.4 ax.bar(x, bar_heights, width) ax.set_xlim((0, 3)) ax.set_ylim((0, max(men, women)*1.1)) ax.set_xticks([i+width/2 for i in x]) ax.set_xticklabels(['male', 'female']) plt.show() How could the categories male and female be plotted on the y-axis, as opposed to the x-axis? Perhaps you're

predict.glm() with three new categories in the test data (r)(error)

阅读更多关于 predict.glm() with three new categories in the test data (r)(error)

I have a data set called data which has 481 092 rows. I split data into two equal halves: The first halve (row 1: 240 546) is called train and was used for the glm() ; the second halve (row 240 547 : 481 092) is called test and should be used to validate the model; Then I started the regression: testreg <- glm(train$returnShipment ~ train$size + train$color + train$price + train$manufacturerID + train$salutation + train$state + train$age + train$deliverytime, family=binomial(link="logit"), data=train) Now the prediction: prediction <- predict.glm(testreg, newdata=test, type="response") gives

predict.glm() with three new categories in the test data (r)(error)

阅读更多关于 predict.glm() with three new categories in the test data (r)(error)

问题 I have a data set called data which has 481 092 rows. I split data into two equal halves: The first halve (row 1: 240 546) is called train and was used for the glm() ; the second halve (row 240 547 : 481 092) is called test and should be used to validate the model; Then I started the regression: testreg <- glm(train$returnShipment ~ train$size + train$color + train$price + train$manufacturerID + train$salutation + train$state + train$age + train$deliverytime, family=binomial(link="logit"),

Matplotlib: how to plot categorical data on the y-axis?

阅读更多关于 Matplotlib: how to plot categorical data on the y-axis?

问题 Let's say that I have the following code, which comes from here: gender = ['male','male','female','male','female'] import matplotlib.pyplot as plt from collections import Counter c = Counter(gender) men = c['male'] women = c['female'] bar_heights = (men, women) x = (1, 2) fig, ax = plt.subplots() width = 0.4 ax.bar(x, bar_heights, width) ax.set_xlim((0, 3)) ax.set_ylim((0, max(men, women)*1.1)) ax.set_xticks([i+width/2 for i in x]) ax.set_xticklabels(['male', 'female']) plt.show() How could

How can I one hot encode multiple variables with big data in R?

阅读更多关于 How can I one hot encode multiple variables with big data in R?

问题 I currently have a dataframe with 260,000 rows and 50 columns where 3 columns are numeric and the rest are categorical. I wanted to one hot encode the categorical columns in order to perform PCA and use regression to predict the class. How can I go about accomplishing the below example in R? Example: V1 V2 V3 V4 V5 .... VN-1 VN to V1_a V1_b V2_a V2_b V2_c V3_a V3_b and so on 回答1: You can use model.matrix or sparse.model.matrix . Something like this: sparse.model.matrix(~. -1, data = your_data

pd.get_dummies() slow on large levels

阅读更多关于 pd.get_dummies() slow on large levels

问题 I'm unsure if this is already the fastest possible method, or if I'm doing this inefficiently. I want to hot encode a particular categorical column which has 27k+ possible levels. The column has different values in 2 different datasets, so I combined the levels first before using get_dummies() def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True): col1b = set(df2[column_name].unique()) col1a = set(df[column_name].unique()) combined_cats = list(col1a.union(col1b)) df[column