I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retai
In the function removeSparseTerms(), the argument sparse = x means:
"remove all terms whose sparsity is greater than the threshold (x)".
e.g: removeSparseTerms(my_dtm, sparse = 0.90) means remove all terms in the corpus whose sparsity is greater than 90%.
For example, a term that appears say just 4 times in a corpus of say size 1000, will have a frequency of appearance of 0.004 =4/1000.
This term's sparsity will be (1000-4)/1000 = 1- 0.004 = 0.996 = 99.6%.
Therefore if sparsity threshold is set to sparse = 0.90, this term will be removed as its sparsity (0.996) is greater than the upper bound sparsity (0.90).
However, if sparsity threshold is set to sparse = 0.999, this term will not be removed as its sparsity (0.996) is lower than the upper bound sparsity (0.999).