Problem with big data (?) during computation of sequence distances using TraMineR

前端未结

关注

 2  2049

不要未来只要你来 2020-12-09 22:21

I am trying to run an optimal matching analysis using TraMineR but it seems that I am encountering an issue with the size of the dataset. I have a big dataset of European co

2条回答

予麋鹿 (楼主)

2020-12-09 22:55
An easy solution which often works well is to analyze a sample only of your data. For instance
```
employdat.sts <- employdat.sts[sample(nrow(employdat.sts),5000),]
```
would extract a random sample of 5000 sequences. Exploring such an important sample should be largely sufficient to find out the characteristics of your sequences, including their diversity.

To improve representativeness, you can even resort to some stratified sampling (e.g., by first or last state, or by some covariates available in your data set). Since you have the original data set at hand, you can fully control the random sampling design.

Update

If clustering is the objective and you need a cluster membership for each individual sequence see https://stackoverflow.com/a/63037549/1586731
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...