Vowpal Wabbit how to represent categorical features

后端 未结 1 1368
慢半拍i
慢半拍i 2020-12-08 12:14

I have the following data with all categorical variables:

    class  education    income    social_standing
    1       basic       low       good
    0              


        
相关标签:
1条回答
  • 2020-12-08 12:40

    Yes, you are correct.

    This representation would definitely work with vowpal wabbit, but under some conditions, may not be optimal (it depends).

    To represent non-ordered, categorical variables (with discrete values), the standard vowpal wabbit trick is to use logical/boolean values for each possible (name, value) combination (e.g. person_is_good, color_blue, color_red). The reason this works is that vw implicitly assumes a value of 1 whereever a value is missing. There's no practical difference between color_red, color=red, color_is_red, or even (color,red) and color_red:1 except hash locations in memory. The only characters you can not use in a variable name are the special separators (: and |) and white-space.

    Terminology note: this trick of converting each (feature + value) pair into a separate feature is sometimes called "One Hot Encoding".

    But in this case the variable-values may not be "strictly categorical". They may be:

    • Strictly ordered, e.g (low < basic < high < v_high)
    • Presumably have a monotonic relation with the label you're trying to predict

    so by making them "strict categorical" (my term for a variable with a discrete range which doesn't have the two properties above) you may be losing some information that may help learning.

    In your particular case, you may get better result by converting the values to numeric, e.g. (1, 2, 3, 4) for education. i.e you could use something like:

    1 |person education:2 income:1 social_standing:2
    0 |person education:1 income:2 social_standing:3
    1 |person education:3 income:1 social_standing:1
    0 |person education:4 income:2 social_standing:2
    

    The training set in the question should work fine, because even when you convert all your discrete variables into boolean variables like you did, vw should self-discover both the ordering and the monotonicity with the label from the data itself, as long as the two properties above are true, and there's enough data to deduce them.

    Here's the short cheat-sheet for encoding variables in vowpal wabbit:

    Variable type       How to encode                readable example
    -------------       -------------                ----------------
    boolean             only encode the true case    is_alive
    categorical         append value to name         color=green
    ordinal+monotonic   :approx_value                education:2
    numeric             :actual_value                height:1.85
    

    Final notes:

    • In vw all variables are numeric. The encoding tricks are just practical ways to make things appear as categorical or boolean. Boolean variables are simply numeric 0 or 1; Categorical variables can be encoded as boolean: name+value:1.
    • Any variable whose value is not monotonic with the label, may be less useful when numerically encoded.
    • Any variable that is not linearly related to the label may benefit from a non-linear transformation before training.
    • Any variable with a zero value will not make a difference to the model (exception: when the --initial_weight <value> option is used) so it can be dropped from the training set
    • When parsing a feature, only : is considered a special separator (between the variable name and its numeric value) anything else is considered a part of the name and the whole name string is hashed to a location in memory. A missing :<value> part implies :1

    Edit: what about name-spaces?

    Name spaces are prepended to feature names with a special-char separator so they map identical features to different hash locations. Example:

    |E low |I low
    

    Is essentially equivalent to the (no name spaces flat example):

    |  E^low:1 I^low:1
    

    The main use of name-spaces is to easily redefine all members of a name-space to something else, ignore a full name space of features, cross features of a name space with another etc. (see -q, --cubic, --redefine, --ignore, --keep options).

    0 讨论(0)
提交回复
热议问题