I have a dataframe with this type of data (too many columns):
col1 int64
col2 int64
col3 category
col4 category
col5 categ
This works for me:
pandas.factorize( ['B', 'C', 'D', 'B'] )[0]
Output:
[0, 1, 2, 0]
You can do it less code like below :
f = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),'col3':list('ababb')})
f['col1'] =f['col1'].astype('category').cat.codes
f['col2'] =f['col2'].astype('category').cat.codes
f['col3'] =f['col3'].astype('category').cat.codes
f
What I do is, I replace
values.
Like this-
df['col'].replace(to_replace=['category_1', 'category_2', 'category_3'], value=[1, 2, 3], inplace=True)
In this way, if the col
column has categorical values, they get replaced by the numerical values.
If your concern was only that you making a extra column and deleting it later, just dun use a new column at the first place.
dataframe = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
dataframe.col3 = pd.Categorical.from_array(dataframe.col3).codes
You are done. Now as Categorical.from_array
is deprecated, use Categorical
directly
dataframe.col3 = pd.Categorical(dataframe.col3).codes
If you also need the mapping back from index to label, there is even better way for the same
dataframe.col3, mapping_index = pd.Series(dataframe.col3).factorize()
check below
print(dataframe)
print(mapping_index.get_loc("c"))