categorical-data

Pandas: Convert lists within a single column to multiple columns

心已入冬 提交于 2019-12-07 04:37:37
问题 I have a dataframe that includes columns with multiple attributes separated by commas: df = pd.DataFrame({'id': [1,2,3], 'labels' : ["a,b,c", "c,a", "d,a,b"]}) id labels 0 1 a,b,c 1 2 c,a 2 3 d,a,b (I know this isn't an ideal situation, but the data originates from an external source.) I want to turn the multi-attribute columns into multiple columns, one for each label, so that I can treat them as categorical variables. Desired output: id a b c d 0 1 True True True False 1 2 True False True

“Automatically” calculate linear combination of parameter estimates with PROC GLM

孤者浪人 提交于 2019-12-06 12:44:43
Background : I have a categorical variable, X , with four levels that I fit as separate dummy variables. Thus, there are three total dummy variables representing x=1, x=2, x=3 (x=0 is baseline). Problem/issue : I want to be able to calculate the value of a linear combination (i.e. using SAS as a calculator) of these dummy variables. For example, 2*B1 + 2*B2 + B3. In Stata, this can be done using the lincom command, which uses the stored beta estimates to calculate linear combinations of the parameters. In SAS in a procedure such as PROC GLM, I think I should use the ESTIMATE statement, but I'm

SQL query to get the subtotal of some rows

…衆ロ難τιáo~ 提交于 2019-12-06 12:23:00
问题 What would be the SQL query script if I want to get the total items and total revenue for each manager including his team? Suppose I have this table items_revenue with columns: | id |is_manager|manager_id| name |no_of_items| revenue | | 1 | 1 | 0 | Manager1 | 621 | 833 | | 2 | 1 | 0 | Manager2 | 458 | 627 | | 3 | 1 | 0 | Manager3 | 872 | 1027 | ... | 8 | 0 | 1 | Member1 | 1258 | 1582 | | 9 | 0 | 2 | Member2 | 5340 | 8827 | | 10 | 0 | 3 | Member3 | 3259 | 5124 | All the managers and their

Automatically compare nested models from mice's glm.mids

你说的曾经没有我的故事 提交于 2019-12-06 11:07:30
I have a multiply-imputed model from R's mice package in which there are lots of factor variables. For example: library(mice) library(Hmisc) # turn all the variables into factors fake = nhanes fake$age = as.factor(nhanes$age) fake$bmi = cut2(nhanes$bmi, g=3) fake$chl = cut2(nhanes$chl, g=3) head(fake) age bmi hyp chl 1 1 <NA> NA <NA> 2 2 [20.4,25.5) 1 [187,206) 3 1 <NA> 1 [187,206) 4 3 <NA> NA <NA> 5 1 [20.4,25.5) 1 [113,187) 6 3 <NA> NA [113,187) imput = mice(nhanes) # big model fit1 = glm.mids((hyp==2) ~ age + bmi + chl, data=imput, family = binomial) I want to test the significance of each

Preprocess large datafile with categorical and continuous features

强颜欢笑 提交于 2019-12-06 07:28:54
First thanks for reading me and thanks a lot if you can give any clue to help me solving this. As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional. My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing. In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line "RENAULT";"CLIO III";"CLIO III

Convert text to int64 categorical in Pandas

廉价感情. 提交于 2019-12-06 06:06:25
I have some artist names in data['artist'] that I would like to convert to a categorical column via: x = data['artist'].astype('category').cat.codes x.dtype Returns: dtype('int32') I am getting negative numbers which suggests some sort of overflow situation. So, I'd like to use np.int64 instead but I can't find documentation on how to accomplish this. x = data['artist'].astype('category').cat.codes.astype(np.int64) x.dtype Gives dtype('int64') but it is clear that the int32 gets converted to int64 and so the negative value is still present x = data['artist'].astype('category').cat.codes.astype

Handling NULL values in Spark StringIndexer

本秂侑毒 提交于 2019-12-06 00:32:55
I have a dataset with some categorical string columns and I want to represent them in double type. I used StringIndexer for this convertion and It works but when I tried it in another dataset that has NULL values it gave java.lang.NullPointerException error and did not work. For better understanding here is my code: for(col <- cols){ out_name = col ++ "_" var indexer = new StringIndexer().setInputCol(col).setOutputCol(out_name) var indexed = indexer.fit(df).transform(df) df = (indexed.withColumn(col, indexed(out_name))).drop(out_name) } So how can I solve this NULL data problem with

Lexical dispersion plot is seaborn

故事扮演 提交于 2019-12-05 22:40:40
I am using the seaborn module to produce a plot similar to the example below. import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns location = "/global/scratch/umalmonj/WRF/juris/golden_hourly_manual_obs.csv" df = pd.read_csv(location,usecols= ["Year","Month","Day","Time","Weather"],parse_dates=[["Year","Month","Day","Time"]]) I have a df that looks like: Year_Month_Day_Time Weather 0 2010-01-01 00:00:00 NaN 1 2010-01-01 01:00:00 NaN 2 2010-01-01 02:00:00 NaN .. 7 2010-01-01 07:00:00 Snow 8 2010-01-01 08:00:00 Snow 9 2010-01-01 09:00:00 Snow Showers .. 18

How to plot parallel coordinates with multiple categorical variables in R

自作多情 提交于 2019-12-05 19:13:43
问题 I am facing a difficulty while plotting a parallel coordinates plot using the ggparcoord from the GGally package. As there are two categorical variables, what I want to show in the visualisation is like the image below. I've found that in ggparcoord , groupColumn is only allowed to a single variable to group (colour) by, and surely I can use showPoints to mark the values on the axes, but i also need to vary the shape of these markers according to the categorical variables. Is there other

Matplotlib: how to plot a line with categorical data on the x-axis?

瘦欲@ 提交于 2019-12-05 16:44:25
I am trying to plot a few lines (not a bar plot, as in this case ). My y values are float , whereas x values are categorical data . How to do this in matplotlib ? My values: data1=[5.65,7.61,8.17,7.60,9.54] data2=[7.61,16.17,16.18,19.54,19.81] data3=[29.55,30.24,31.51,36.40,35.47] My categories: x_axis=['A','B','C','D','E'] The code I am using, which does not give me what I want: import matplotlib.pyplot as plt fig=plt.figure() #Creates a new figure ax1=fig.add_subplot(111) #Plot with: 1 row, 1 column, first subplot. line1 = ax1.plot(str(x_axis), data1,'ko-',label='line1') #Plotting data1