subsetting based on number of observations in a factor variable

前端未结

关注

 2  681

情话喂你 2021-01-25 15:34

how do you subset based on the number of observations of the levels of a factor variable? I have a dataset with 1,000,000 rows and nearly 3000 levels, and I want to subset out

2条回答

没有蜡笔的小新 (楼主)

2021-01-25 16:26
table, subset that, and match based on the names of that subset. Probably will want to droplevels thereafter.

EIDT

Some sample data:
```
set.seed(1234)
data <- data.frame(factor = factor(sample(10000:12999, 1000000, 
  TRUE, prob=rexp(3000))))
```
Has some categories with few cases
```
> min(table(data$factor))
[1] 1
```
Remove records from case with less than 100 of those with the same value of factor.
```
tbl <- table(data$factor)
data <- droplevels(data[data$factor %in% names(tbl)[tbl >= 100],,drop=FALSE])
```
Check:
```
> min(table(data$factor))
[1] 100
```
Note that data and factor are not very good names since they are also builtin functions.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...