categorical-data | 易学教程

Pandas: Convert lists within a single column to multiple columns

阅读更多关于 Pandas: Convert lists within a single column to multiple columns

问题 I have a dataframe that includes columns with multiple attributes separated by commas: df = pd.DataFrame({'id': [1,2,3], 'labels' : ["a,b,c", "c,a", "d,a,b"]}) id labels 0 1 a,b,c 1 2 c,a 2 3 d,a,b (I know this isn't an ideal situation, but the data originates from an external source.) I want to turn the multi-attribute columns into multiple columns, one for each label, so that I can treat them as categorical variables. Desired output: id a b c d 0 1 True True True False 1 2 True False True

“Automatically” calculate linear combination of parameter estimates with PROC GLM

阅读更多关于 “Automatically” calculate linear combination of parameter estimates with PROC GLM

Background : I have a categorical variable, X , with four levels that I fit as separate dummy variables. Thus, there are three total dummy variables representing x=1, x=2, x=3 (x=0 is baseline). Problem/issue : I want to be able to calculate the value of a linear combination (i.e. using SAS as a calculator) of these dummy variables. For example, 2*B1 + 2*B2 + B3. In Stata, this can be done using the lincom command, which uses the stored beta estimates to calculate linear combinations of the parameters. In SAS in a procedure such as PROC GLM, I think I should use the ESTIMATE statement, but I'm

SQL query to get the subtotal of some rows

阅读更多关于 SQL query to get the subtotal of some rows

问题 What would be the SQL query script if I want to get the total items and total revenue for each manager including his team? Suppose I have this table items_revenue with columns: | id |is_manager|manager_id| name |no_of_items| revenue | | 1 | 1 | 0 | Manager1 | 621 | 833 | | 2 | 1 | 0 | Manager2 | 458 | 627 | | 3 | 1 | 0 | Manager3 | 872 | 1027 | ... | 8 | 0 | 1 | Member1 | 1258 | 1582 | | 9 | 0 | 2 | Member2 | 5340 | 8827 | | 10 | 0 | 3 | Member3 | 3259 | 5124 | All the managers and their

Automatically compare nested models from mice's glm.mids

阅读更多关于 Automatically compare nested models from mice's glm.mids

I have a multiply-imputed model from R's mice package in which there are lots of factor variables. For example: library(mice) library(Hmisc) # turn all the variables into factors fake = nhanes fake$age = as.factor(nhanes$age) fake$bmi = cut2(nhanes$bmi, g=3) fake$chl = cut2(nhanes$chl, g=3) head(fake) age bmi hyp chl 1 1 <NA> NA <NA> 2 2 [20.4,25.5) 1 [187,206) 3 1 <NA> 1 [187,206) 4 3 <NA> NA <NA> 5 1 [20.4,25.5) 1 [113,187) 6 3 <NA> NA [113,187) imput = mice(nhanes) # big model fit1 = glm.mids((hyp==2) ~ age + bmi + chl, data=imput, family = binomial) I want to test the significance of each

Preprocess large datafile with categorical and continuous features

阅读更多关于 Preprocess large datafile with categorical and continuous features

First thanks for reading me and thanks a lot if you can give any clue to help me solving this. As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional. My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing. In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line "RENAULT";"CLIO III";"CLIO III

Convert text to int64 categorical in Pandas

阅读更多关于 Convert text to int64 categorical in Pandas

I have some artist names in data['artist'] that I would like to convert to a categorical column via: x = data['artist'].astype('category').cat.codes x.dtype Returns: dtype('int32') I am getting negative numbers which suggests some sort of overflow situation. So, I'd like to use np.int64 instead but I can't find documentation on how to accomplish this. x = data['artist'].astype('category').cat.codes.astype(np.int64) x.dtype Gives dtype('int64') but it is clear that the int32 gets converted to int64 and so the negative value is still present x = data['artist'].astype('category').cat.codes.astype

Handling NULL values in Spark StringIndexer

阅读更多关于 Handling NULL values in Spark StringIndexer

I have a dataset with some categorical string columns and I want to represent them in double type. I used StringIndexer for this convertion and It works but when I tried it in another dataset that has NULL values it gave java.lang.NullPointerException error and did not work. For better understanding here is my code: for(col <- cols){ out_name = col ++ "_" var indexer = new StringIndexer().setInputCol(col).setOutputCol(out_name) var indexed = indexer.fit(df).transform(df) df = (indexed.withColumn(col, indexed(out_name))).drop(out_name) } So how can I solve this NULL data problem with

Lexical dispersion plot is seaborn

阅读更多关于 Lexical dispersion plot is seaborn

I am using the seaborn module to produce a plot similar to the example below. import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns location = "/global/scratch/umalmonj/WRF/juris/golden_hourly_manual_obs.csv" df = pd.read_csv(location,usecols= ["Year","Month","Day","Time","Weather"],parse_dates=[["Year","Month","Day","Time"]]) I have a df that looks like: Year_Month_Day_Time Weather 0 2010-01-01 00:00:00 NaN 1 2010-01-01 01:00:00 NaN 2 2010-01-01 02:00:00 NaN .. 7 2010-01-01 07:00:00 Snow 8 2010-01-01 08:00:00 Snow 9 2010-01-01 09:00:00 Snow Showers .. 18

How to plot parallel coordinates with multiple categorical variables in R

阅读更多关于 How to plot parallel coordinates with multiple categorical variables in R

问题 I am facing a difficulty while plotting a parallel coordinates plot using the ggparcoord from the GGally package. As there are two categorical variables, what I want to show in the visualisation is like the image below. I've found that in ggparcoord , groupColumn is only allowed to a single variable to group (colour) by, and surely I can use showPoints to mark the values on the axes, but i also need to vary the shape of these markers according to the categorical variables. Is there other

Matplotlib: how to plot a line with categorical data on the x-axis?

阅读更多关于 Matplotlib: how to plot a line with categorical data on the x-axis?

I am trying to plot a few lines (not a bar plot, as in this case ). My y values are float , whereas x values are categorical data . How to do this in matplotlib ? My values: data1=[5.65,7.61,8.17,7.60,9.54] data2=[7.61,16.17,16.18,19.54,19.81] data3=[29.55,30.24,31.51,36.40,35.47] My categories: x_axis=['A','B','C','D','E'] The code I am using, which does not give me what I want: import matplotlib.pyplot as plt fig=plt.figure() #Creates a new figure ax1=fig.add_subplot(111) #Plot with: 1 row, 1 column, first subplot. line1 = ax1.plot(str(x_axis), data1,'ko-',label='line1') #Plotting data1