Closest equivalent of a factor variable in Python Pandas

前端未结

关注

 4  383

我在风中等你

What is the closest equivalent to an R Factor variable in Python pandas?

相关标签:

4条回答

闹比i

2020-12-13 18:52

If you're looking to do modeling etc, lots of goodies for factor within the patsy library. I will admit to having struggled with this myself. I found these slides helpful. Wish I could give a better example, but this is as far as I've gotten myself.

0 讨论(0)
发布评论:

提交评论
- 加载中...
栀梦

2020-12-13 18:55
This question seems to be from a year back but since it is still open here's an update. pandas has introduced a categorical dtype and it operates very similar to factors in R. Please see this link for more information:

http://pandas-docs.github.io/pandas-docs-travis/categorical.html

Reproducing a snippet from the link above showing how to create a "factor" variable in pandas.
```
In [1]: s = Series(["a","b","c","a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a < b < c]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-13 18:55
If you're looking to map a categorical variable to a number as R does, Pandas implemented a function that will give you just that: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html
```
import pandas as pd

df = pd.read_csv('path_to_your_file')
df['new_factor'], _ = pd.factorize(df['old_categorical'], sort=True)
```
This function returns both the enumerated mapping as well as a list of unique values. If you're just doing variable assignment, you'll have to throw the latter away as above.

If you want a homegrown solution, you can use a combination of a set and a dictionary within a function. This method is a bit easier to apply over multiple columns, but you do have to note that None, NaN, etc. will be a included as a category in this method:
```
def factor(var):
    var_set = set(var)
    var_set = {x: y for x, y in [pair for pair in zip(var_set, range(len(var_set)))]}
    return [var_set[x] for x in var]


df['new_factor1'] = df['old_categorical1'].apply(factor)
df[['new_factor2', 'new_factor3']] = df[['old_categorical2', 'old_categorical3']].apply(factor)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

無奈伤痛

2020-12-13 19:04

C # array containing category data
V # array containing numerical data

H = np.unique(C)
mydict = {}
for h in H:
    mydict[h] = V[C==h]


boxplot(mydict.values(), labels=mydict.keys())

0 讨论(0)