Working with text classification and big sparse matrices in R

不羁的心 提交于 2019-12-04 15:44:51

At what moment did you reach ram constraints?

quanteda is good package to work with NLP on medium datasets. But also I suggest to try my text2vec package. Generally it is considerably memory friendly and doesn't require to load all the raw text into the RAM (for example it can create DTM for wikipedia dump on a 16gb laptop).

Second point is that I strongly don't recommend to convert data into data.frame. Try to work with sparseMatrix objects directly.

Following method will work good for text classification:

  1. logistic regression with L1 penalty (see glmnet package)
  2. Linear SVM (see LiblineaR, but worth to serach for alternatives)
  3. Also worth to try `xgboost. I would prefer linear models. So you can try linear booster.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!