categorical-data | 易学教程

How to deal with co-linearity of dummy variables for linear regression?

阅读更多关于 How to deal with co-linearity of dummy variables for linear regression?

问题 I am using scikit-learn LogisticRegression on a dataset of household characteristics and trying to understand how to prepare the independent variables. I have created binary dummy variables in place of categorical variables. e.g. The variable DWELLING_TYPE which had 3 possible values DetachedHouse , SemiDetached and Apartment has been replaced with 3 binary variables DWELLING_TYPE_DetachedHouse , DWELLING_TYPE_SemiDetached and DWELLING_TYPE_Apartment that each has the value 1 or 0`. Clearly

plt.plot issue in pandas with categorical index DataFrame

阅读更多关于 plt.plot issue in pandas with categorical index DataFrame

问题 I have a DataFrame with categorical index like so: import pandas as pd import matplotlib.pyplot as plt %matplotlib notebook accidents_by_day=pd.DataFrame({'num_accidents':[5659,5298,4917,4461,4181,4038,3985], 'weekday':[7,1,6,5,4,2,3]}) weekday_map={1:'Sunday',2:'Monday',3:'Tuesday',4:'Wednesday',5:'Thursday',6:'Friday',7:'Saturday'} new_index=(pd.CategoricalIndex(accidents_by_day.weekday.map(weekday_map)). reorder_categories(new_categories=['Monday','Tuesday','Wednesday','Thursday', 'Friday'

Categorical variables in R - which one does R pick as reference?

阅读更多关于 Categorical variables in R - which one does R pick as reference?

问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 4 years ago . When R performs a regression using a categorical variable, it's effectively dummy coding. That is, one of levels is omitted as base or reference and the regression formula includes dummies for all the other levels. But which one is it, that R picks as reference and how I can influence this choice? Example data with four levels (from UCLA's IDRE): hsb2 <- read.csv("http://www

Matplotlib dot plot with two categorical variables

阅读更多关于 Matplotlib dot plot with two categorical variables

问题 I would like to produce a specific type of visualization, consisting of a rather simple dot plot but with a twist: both of the axes are categorical variables (i.e. ordinal or non-numerical values). And this complicates matters instead of making it easier. To illustrate this question, I will be using a small example dataset that is a modification from seaborn.load_dataset("tips") and defined as such: import pandas from six import StringIO df = """total_bill | tip | sex | smoker | day | time |

How to find the correlation between continuous and categorical variables in R

阅读更多关于 How to find the correlation between continuous and categorical variables in R

问题 sorry, I edited my question. In R, you can use the cor () function to find the correlation using only Pearson and Spearman correlation between Continuous variables. Which function should I use to get correlation between categorical variable and categorical variable? and Which function should I use to get correlation between categorical variables and Continuous variable Thank you in advance. 来源： https://stackoverflow.com/questions/41053431/how-to-find-the-correlation-between-continuous-and

Convert categorical data in data frame to weighted adjacency matrix

阅读更多关于 Convert categorical data in data frame to weighted adjacency matrix

问题 I have the following data frame, call it DF, which is a data frame consisting in three vectors: "Chunk" "Name," and "Frequency." I need to turn it into a NameXName adjacency matrix where Names are considered adjacent when they reside in the same chunk. So for example, in the first lines, Gretel and Friedrich are adjacent because they are both in Chunk2. And the weight of the relationship should be based on "Frequency," precisely the number of times they are co-present in the same chunk, so

How can I create a Partial Dependence plot for a categorical variable in R?

阅读更多关于 How can I create a Partial Dependence plot for a categorical variable in R?

问题 I am working with the r-package randomForest and have successfully made a random forest model and an importance plot. I am working with a dichotomous response and several categorical predictors. However, I can't figure out how to make partial dependence plots for my categorical variables. I have tried using the randomForest command partialPLot. But I get the following error: > partialPlot(rf.5, rf.train.1, religion) Error in is.finite(x) : default method not implemented for type 'list' . So

back fill missing data with a label for a window of a time

阅读更多关于 back fill missing data with a label for a window of a time

问题 I want to backfill each column based on time (1 day ,2 day) with different label. here is the code: from datetime import datetime, timedelta import pandas as pd import numpy as np import random np.random.seed(11) date_today = datetime.now() ndays = 15 df = pd.DataFrame({'date': [date_today + timedelta(days=x) for x in range(ndays)], 'test': pd.Series(np.random.randn(ndays)), 'test2':pd.Series(np.random.randn(ndays))}) df = df.set_index('date') df = df.mask(np.random.random(df.shape) < .7)

r data.table usage in function call

阅读更多关于 r data.table usage in function call

问题 I want to perform a data.table task over and over in a function call: Reduce number of levels for large categorical variables My problem is similar to Data.table and get() command (R) or pass column name in data.table using variable in R but I can't get it to work Without a function call this works just fine: # Load data.table require(data.table) # Some data set.seed(1) dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)), weight = rnorm(n = 10e3, mean = 70, sd = 20)) #

Preprocess large datafile with categorical and continuous features

阅读更多关于 Preprocess large datafile with categorical and continuous features

问题 First thanks for reading me and thanks a lot if you can give any clue to help me solving this. As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional. My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing. In my data I have 24 values : 13 are nominal, 6