Closest equivalent of a factor variable in Python Pandas

前端 未结 4 377
我在风中等你
我在风中等你 2020-12-13 18:33

What is the closest equivalent to an R Factor variable in Python pandas?

相关标签:
4条回答
  • 2020-12-13 18:52

    If you're looking to do modeling etc, lots of goodies for factor within the patsy library. I will admit to having struggled with this myself. I found these slides helpful. Wish I could give a better example, but this is as far as I've gotten myself.

    0 讨论(0)
  • 2020-12-13 18:55

    This question seems to be from a year back but since it is still open here's an update. pandas has introduced a categorical dtype and it operates very similar to factors in R. Please see this link for more information:

    http://pandas-docs.github.io/pandas-docs-travis/categorical.html

    Reproducing a snippet from the link above showing how to create a "factor" variable in pandas.

    In [1]: s = Series(["a","b","c","a"], dtype="category")
    
    In [2]: s
    Out[2]: 
    0    a
    1    b
    2    c
    3    a
    dtype: category
    Categories (3, object): [a < b < c]
    
    0 讨论(0)
  • 2020-12-13 18:55

    If you're looking to map a categorical variable to a number as R does, Pandas implemented a function that will give you just that: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html

    import pandas as pd
    
    df = pd.read_csv('path_to_your_file')
    df['new_factor'], _ = pd.factorize(df['old_categorical'], sort=True)
    
    

    This function returns both the enumerated mapping as well as a list of unique values. If you're just doing variable assignment, you'll have to throw the latter away as above.

    If you want a homegrown solution, you can use a combination of a set and a dictionary within a function. This method is a bit easier to apply over multiple columns, but you do have to note that None, NaN, etc. will be a included as a category in this method:

    def factor(var):
        var_set = set(var)
        var_set = {x: y for x, y in [pair for pair in zip(var_set, range(len(var_set)))]}
        return [var_set[x] for x in var]
    
    
    df['new_factor1'] = df['old_categorical1'].apply(factor)
    df[['new_factor2', 'new_factor3']] = df[['old_categorical2', 'old_categorical3']].apply(factor)
    
    0 讨论(0)
  • 2020-12-13 19:04
    C # array containing category data
    V # array containing numerical data
    
    H = np.unique(C)
    mydict = {}
    for h in H:
        mydict[h] = V[C==h]
    
    
    boxplot(mydict.values(), labels=mydict.keys())
    
    0 讨论(0)
提交回复
热议问题