one-hot-encoding

Why does Spark's OneHotEncoder drop the last category by default?

社会主义新天地 提交于 2019-12-01 03:24:45
I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default. For example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss = StringIndexer(inputCol="c",outputCol="c_idx") >>> ff = ss.fit(fd).transform(fd) >>> ff.show() +----+---+-----+ | x| c|c_idx| +----+---+-----+ | 1.0| a| 0.0| | 1.5| a| 0.0| |10.0| b| 1.0| | 3.2| c| 2.0| +----+---+-----+ By default, the OneHotEncoder will drop the last category: >>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec") >>> fe = oe.transform(ff) >>>

OneHotEncoding Mapping

半城伤御伤魂 提交于 2019-12-01 01:14:47
To discretize categorical features I'm using a LabelEncoder and OneHotEncoder. I know that LabelEncoder maps data alphabetically, but how does OneHotEncoder map data? I have a pandas dataframe, dataFeat with 5 different columns, and 4 possible labels, like above. dataFeat = data[['Feat1', 'Feat2', 'Feat3', 'Feat4', 'Feat5']] Feat1 Feat2 Feat3 Feat4 Feat5 A B A A A B B C C C D D A A B C C A A A I apply a labelencoder like this, le = preprocessing.LabelEncoder() intIndexed = dataFeat.apply(le.fit_transform) This is how the labels are encoded by the LabelEncoder Label LabelEncoded A 0 B 1 C 2 D 3

Convert a 2d matrix to a 3d one hot matrix numpy

不羁的心 提交于 2019-11-29 02:29:40
问题 I have np matrix and I want to convert it to a 3d array with one hot encoding of the elements as third dimension. Is there a way to do with without looping over each row eg a=[[1,3], [2,4]] should be made into b=[[1,0,0,0], [0,0,1,0], [0,1,0,0], [0,0,0,1]] 回答1: Approach #1 Here's a cheeky one-liner that abuses broadcasted comparison - (np.arange(a.max()) == a[...,None]-1).astype(int) Sample run - In [120]: a Out[120]: array([[1, 7, 5, 3], [2, 4, 1, 4]]) In [121]: (np.arange(a.max()) == a[...

Running get_dummies on several DataFrame columns?

非 Y 不嫁゛ 提交于 2019-11-28 22:41:59
How can one idiomatically run a function like get_dummies , which expects a single column and returns several, on multiple DataFrame columns? Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround): In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'], ...: 'C': [1, 2, 3]}) In [2]: df Out[2]: A B C 0 a c 1 1 b c 2 2 a b 3 In [3]: pd.get_dummies(df) Out[3]: C A_a A_b B_b B_c 0 1 1 0 0 1 1 2 0 1 0 1 2 3 1 0 1 0 Workaround for pandas < 0.15.0 You can do it for each column

One Hot Encoding using numpy [duplicate]

 ̄綄美尐妖づ 提交于 2019-11-28 18:41:23
This question already has an answer here: Convert array of indices to 1-hot encoded numpy array 17 answers If the input is zero I want to make an array which looks like this: [1,0,0,0,0,0,0,0,0,0] and if the input is 5: [0,0,0,0,0,1,0,0,0,0] For the above I wrote: np.put(np.zeros(10),5,1) but it did not work. Is there any way in which, this can be implemented in one line? Usually, when you want to get a one-hot encoding for classification in machine learning, you have an array of indices. import numpy as np nb_classes = 6 targets = np.array([[2, 3, 4, 0]]).reshape(-1) one_hot_targets = np.eye

Converting a Pandas Dataframe column into one hot labels

风格不统一 提交于 2019-11-28 08:09:56
问题 I have a pandas dataframe similar to this: Col1 ABC 0 XYZ A 1 XYZ B 2 XYZ C By using the pandas get_dummies() function on column ABC, I can get this: Col1 A B C 0 XYZ 1 0 0 1 XYZ 0 1 0 2 XYZ 0 0 1 While I need something like this, where the ABC column has a list / array datatype: Col1 ABC 0 XYZ [1,0,0] 1 XYZ [0,1,0] 2 XYZ [0,0,1] I tried using the get_dummies function and then combining all the columns into the column which I wanted. I found lot of answers explaining how to combine multiple

One hot encoding of string categorical features

自作多情 提交于 2019-11-28 04:31:51
I'm trying to perform a one hot encoding of a trivial dataset. data = [['a', 'dog', 'red'] ['b', 'cat', 'green']] What's the best way to preprocess this data using Scikit-Learn? On first instinct, you'd look towards Scikit-Learn's OneHotEncoder . But the one hot encoder doesn't support strings as features; it only discretizes integers. So then you would use a LabelEncoder , which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels

Can sklearn random forest directly handle categorical features?

徘徊边缘 提交于 2019-11-28 03:55:12
Say I have a categorical feature, color, which takes the values ['red', 'blue', 'green', 'orange'], and I want to use it to predict something in a random forest. If I one-hot encode it (i.e. I change it to four dummy variables), how do I tell sklearn that the four dummy variables are really one variable? Specifically, when sklearn is randomly selecting features to use at different nodes, it should either include the red, blue, green and orange dummies together, or it shouldn't include any of them. I've heard that there's no way to do this, but I'd imagine there must be a way to deal with

Running get_dummies on several DataFrame columns?

北战南征 提交于 2019-11-27 13:41:04
问题 How can one idiomatically run a function like get_dummies , which expects a single column and returns several, on multiple DataFrame columns? 回答1: Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround): In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'], ...: 'C': [1, 2, 3]}) In [2]: df Out[2]: A B C 0 a c 1 1 b c 2 2 a b 3 In [3]: pd.get_dummies(df) Out[3]: C A_a A_b B_b

Scikit Learn OneHotEncoder fit and transform Error: ValueError: X has different shape than during fitting

余生长醉 提交于 2019-11-27 05:33:23
Below is my code. I know why the error is occurring during transform. It is because of the feature list mismatch during fit and transform. How can i solve this? How can i get 0 for all the rest features? After this i want to use this for partial fit of SGD classifier. Jupyter QtConsole 4.3.1 Python 3.6.2 |Anaconda custom (64-bit)| (default, Sep 21 2017, 18:29:43) Type 'copyright', 'credits' or 'license' for more information IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help. import pandas as pd from sklearn.preprocessing import OneHotEncoder input_df = pd.DataFrame(dict(fruit=[