How to use external data with Elastic MapReduce

问题

From Amazon's EMR FAQ:

Q: Can I load my data from the internet or somewhere other than Amazon S3?

Yes. Your Hadoop application can load the data from anywhere on the internet or from other AWS services. Note that if you load data from the internet, EC2 bandwidth charges will apply. Amazon Elastic MapReduce also provides Hive-based access to data in DynamoDB.

What are the specifications for loading data from external (non-S3) sources? There seems to be a dearth of resources around this option and doesn't appear to be documented in any form.

回答1:

If you want to do it "a hadoop way" you should implement DFS over your data source, or to put referances to your source URLs into some file, which will be input for the MR job.
In the same time hadoop is about moving code to data. Even EMR over S3 is not ideal in this perspectice - EC2 and S3 are different cluster. So it is hard to imegine effective MR procesing if datasource is phisically outside of the data center.

回答2:

Basically what Amazon is saying that programatically you can access any content from internet or any other source via your code. For example you can access a Couch database instance via any HTTP based client APIs.

回答3:

I know that Cassandra package for java has one source package named org.apache.cassandra.hadoop and there are two classes in it that are needed for getting info from Cassandra when you are running the AWS Elastic MapReduce.

Essential classes: ColumnFamilyInputFormat.java and ConfigHelper.java

Go to this link to see an example of what I'm talking about.

来源：https://stackoverflow.com/questions/10918450/how-to-use-external-data-with-elastic-mapreduce

标签

elastic-map-reduce