The context of this question is that I am trying to use the maxmind java api in a pig script that I have written... I do not think that knowing about either is necessary to
Here's how we use the maxmind geoIP;
We put the GeoIPCity.dat file into the cloud and use the cloud location as an argument when we launch the process.
The code where we get the GeoIPCity.data file and create a new LookupService is:
if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
List localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
for (Path localFile : localFiles) {
if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
m_geoipLookupService = new LookupService(new File(localFile.toUri().getPath()));
}
}
}
Here is an abbreviated version of command we use to run our process
$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat -libjars /usr/lib/COMPANY/analytics/libjars/geoiplookup.jar
The critical components of this for running the MindMax component is the -files and -libjars. These are generic options in the GenericOptionsParser.
-files
-libjars
I'm assuming that Hadoop uses the GenericOptionsParser because I can find no reference to it anywhere in my project. :)
If you put the GeoIPCity.dat on the could and specify its using the -files argument, it will be put into the local cache which the mapper can then get in the setup function. It doesn't have to be in setup but only needs to be done once per mapper so is an excellent place to put it.
Then use the -libjars argument to specify the geoiplookup.jar (or whatever you've called yours) and it will be able to use it. We don't put the geoiplookup.jar on the cloud. I'm rolling with the assumption that hadoop will distribute the jar as it needs to.
I hope that all makes sense. I am getting fairly familiar with hadoop/mapreduce, but I didnt' write the pieces that use the maxmind geoip component in the project, so I've had to do a little digging to understand it well enough to do the explanation I have here.
EDIT: Additional description for the -files and -libjars
-files The files argument is used to distribute files through Hadoop Distributed Cache. In the example above, we are distributing the Max Mind geo-ip data file through the Hadoop Distributed Cache. We need access to the Max Mind geo-ip data file to map the user’s ip address to appropriate country, region, city, timezone. The API requires that data file be present locally which is not feasible in a distributed processing environment (we will not be guaranteed which nodes in the cluster will process the data). To distribute the appropriate data to the processing node, we use the Hadoop Distributed Cache infrastructure. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –file argument. Please note that the file we distribute should be available in the cloud (HDFS).
-libjars The –libjars is used to distribute any additional dependencies required by the map-reduce jobs. Like the data file, we also need to copy the dependent libraries to the nodes in the cluster where the job will be run. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –libjars argument.