Using HBase importtsv tool to bulk load data from Java code

与世无争的帅哥 提交于 2019-12-12 05:11:45

问题


I am trying to bulk load csv file to hbase using importtsv and LoadIncrementalHFiles tools that ship with Apache HBase.

We can find the tutorials at these pages: cloudera, apache

I am using Apache hadoop and hbase.

Both sources explains how to use these tools through command prompt. However I want to get this done from Java code. I know I can write custom map reduce as explained on cloudera page. However I want know if I can use classes corresponding to these tools directly in my Java code.

My cluster is running on Ubuntu VM inside VMWare in pseudo distributed mode, whereas my Java code is running on Windows host machine. When doing it through the command prompt on the same machine running cluster, we run following commands:

$HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-1.2.1.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://192.168.23.128:9000/bulkloadoutputdir datatsv  hdfs://192.168.23.128:9000/bulkloadinputdir/

As can be seen above we set HADOOP_CLASSPATH. In my case, I guess I have to copy all xyz-site.xml hadoop configuration files to my Windows machines and set the directory containing it as HADOOP_CLASSPATH environment variable. So I copy pasted core-site.xml, hbase-site.xml, hdfs-site.xml to my Windows machine, set the directory to Windows environment variable HADOOP_CLASSPATH. Apart from these I also added all the required JARs to eclipse project's build path.

But after running the project I got following error:

Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations
    at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:319)
    at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
    at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
    at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
    at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:326)
    at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:301)
    at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:166)
    at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:161)
    at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:794)
    at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:602)
    at org.apache.hadoop.hbase.MetaTableAccessor.tableExists(MetaTableAccessor.java:366)
    at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:403)
    at org.apache.hadoop.hbase.mapreduce.ImportTsv.createSubmittableJob(ImportTsv.java:493)
    at org.apache.hadoop.hbase.mapreduce.ImportTsv.run(ImportTsv.java:737)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.hadoop.hbase.mapreduce.ImportTsv.main(ImportTsv.java:747)
    at HBaseImportTsvBulkLoader.createStoreFilesFromHdfsFiles(HBaseImportTsvBulkLoader.java:36)
    at HBaseImportTsvBulkLoader.main(HBaseImportTsvBulkLoader.java:17)

So somehow importtsv is still not able to find the location of the cluster.

This is how my basic code looks like:

1    import java.io.IOException;
2    
3    import org.apache.hadoop.conf.Configuration;
4    import org.apache.hadoop.hbase.mapreduce.ImportTsv;
5    import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
6    import org.apache.hadoop.conf.Configuration;
7    import org.apache.hadoop.fs.FileStatus;
8    import org.apache.hadoop.fs.FileSystem;
9    import org.apache.hadoop.fs.Path;
10    
11    public class HBaseImportTsvBulkLoader {
12      static Configuration config;
13        
14        public static void main(String[] args) throws Exception {
15          config = new Configuration();
16              copyFileToHDFS();
17          createStoreFilesFromHdfsFiles();
18          loadStoreFilesToTable();
19      }
20        
21        private static void copyFileToHDFS() throws IOException
22        {
23          config.set("fs.defaultFS","hdfs://192.168.23.128:9000"); //192.168.23.128       
24          FileSystem hdfs = FileSystem.get(config);
25          Path localfsSourceDir = new Path("D:\\delete\\bulkloadinputfile1");
26          Path hdfsTargetDir = new Path (hdfs.getWorkingDirectory() + "/");       
27          hdfs.copyFromLocalFile(localfsSourceDir, hdfsTargetDir);
28        }
29        
30        private static void createStoreFilesFromHdfsFiles() throws Exception
31        {
32          String[] _args = {"-Dimporttsv.bulk.output=hdfs://192.168.23.128:9000/bulkloadoutputdir",
33                  "-Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2",
34                  "datatsv",
35                  "hdfs://192.168.23.128:9000/bulkloadinputdir/"};    
36          ImportTsv.main(_args);                                 //**throws exception**
37          
38        }
39        
40        private static void loadStoreFilesToTable() throws Exception
41        {
42          String[] _args = {"hdfs://192.168.23.128:9000/hbasebulkloadoutputdir","datatsv"};
43          LoadIncrementalHFiles.main(_args);
44        }
45    }
46    

Questions

  1. Which all xyz-site.xml fiels are required?

  2. In what way should I be specifying HADOOP_CLASSPATH?

  3. Can I pass the required arguments to main() methods of ImportTsv such as -Dhbase.rootdirbelow:

    String[] _args = {"-Dimporttsv.bulk.output=hdfs://192.168.23.128:9000/bulkloadoutputdir",
            "-Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2",
            "-Dhbase.rootdir=hdfs://192.168.23.128:9000/hbase",
            "datatsv",
            "hdfs://192.168.23.128:9000/bulkloadinputdir/"};
    
  4. Can I use ImportTsv.setConf() to set the same?

来源:https://stackoverflow.com/questions/37548316/using-hbase-importtsv-tool-to-bulk-load-data-from-java-code

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!