Big Data Process and Analysis in R

后端未结

关注

 4  801

隐瞒了意图╮ 2021-02-01 08:36

I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a

4条回答

自闭症患者 (楼主)

2021-02-01 09:11
If you need to operate on the entire 10GB file at once, then I second @Chase's point about getting a larger, possibly cloud-based computer.

(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)

On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.

Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.

A couple of pointers:
- an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
- you can also get a Hadoop cluster via AWS/EC2. Check out Elastic MapReduce for an on-demand cluster, or use Whirr if you need more control over your Hadoop deployment.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...