What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

后端 未结 7 881
旧时难觅i
旧时难觅i 2021-02-01 18:12

I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif

7条回答
  •  忘掉有多难
    2021-02-01 18:39

    IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.

    For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.

    Countries and such things might also be identifiable...

    Age groups (".-.") might also work.

提交回复
热议问题