Accessing a kererized remote HBASE cluster from Spark

问题

I'm attempting to read data from a kerberized HBASE instance from Spark using the Hortonworks SPARK-ON-HBASE connector. My cluster configuration essentially looks like this: I am submitting my spark jobs from a client machine to a remote Spark standalone cluster, and that job is attempting to read data from a seperate HBASE cluster.

If I bypass the standalone cluster by running Spark with master=local[*] directly on my client, I can access the remote HBASE cluster no problem as long as I first kinit from the client. However, when I set my master as the remote cluster with all other configs the same, I receive a null pointer exception at org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:43) (full stack trace below)

Has anyone accomplished a similar architecture that can perhaps lend some insight? Despite the logs not saying anything about an authentication issue, I have a hunch that I may be having an authentication issue when accessing HBASE from the non-kerberized Spark cluster.

Full stack trace:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0: java.lang.NullPointerException
        at org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:43)
        at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:214)
        at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource.init(HBaseResources.scala:125)
        at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.liftedTree1$1(HBaseResources.scala:57)
        at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.acquire(HBaseResources.scala:54)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource.acquire(HBaseResources.scala:120)
        at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.releaseOnException(HBaseResources.scala:74)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource.releaseOnException(HBaseResources.scala:120)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource.getScanner(HBaseResources.scala:144)
        at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anonfun$7.apply(HBaseTableScan.scala:267)
        at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anonfun$7.apply(HBaseTableScan.scala:266)
        at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
        at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
        at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
        at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
        at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
        at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
        at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
        at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
        at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
        at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
        at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
        at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
        at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
        at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
        at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
        at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
        at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
        at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
        at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
        at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
        at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
        at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:43)
        at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:214)
        at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource.init(HBaseResources.scala:125)
        at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.liftedTree1$1(HBaseResources.scala:57)
        at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.acquire(HBaseResources.scala:54)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource.acquire(HBaseResources.scala:120)
        at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.releaseOnException(HBaseResources.scala:74)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource.releaseOnException(HBaseResources.scala:120)
        at org.apache.spark.sql.execution.datasources.hbase.TableResource.getScanner(HBaseResources.scala:144)
        at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anonfun$7.apply(HBaseTableScan.scala:267)
        at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anonfun$7.apply(HBaseTableScan.scala:266)
        at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
        at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
        at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
        at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
        at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
        at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
        at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
        at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
        at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

回答1:

You have a configuration problem where the hbase.client.userprovider.class configuration is not available. You will need to make sure hbase libs and conf files are on the path of your spark executor.

      private static final String USER_PROVIDER_CONF_KEY = "hbase.client.userprovider.class";

  /**
   * Instantiate the {@link UserProvider} specified in the configuration and set the passed
   * configuration via {@link UserProvider#setConf(Configuration)}
   * @param conf to read and set on the created {@link UserProvider}
   * @return a {@link UserProvider} ready for use.
   */
  public static UserProvider instantiate(Configuration conf) {
    Class<? extends UserProvider> clazz =
        conf.getClass(USER_PROVIDER_CONF_KEY, UserProvider.class, UserProvider.class);
    return ReflectionUtils.newInstance(clazz, conf);
  }

回答2:

I stumbled on that symptom (but the root cause may not be the same) and found a very dirty workaround that you may not want to try.

$$ Context $$ Cloudera distro, HBase 1.2.0-CDH5.7.0

$$ Issue #1 $$ Some code clean-ups in the Apache / HortonWorks distros have not been ported to the Cloudera distro, e.g.
java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Scan.setCaching(I)Lorg/apache/hadoop/hbase/client/Scan;

$$ Workaround #1 $$

Download the HBase client JARs from the Horton repo -- specifically "client", "common" and "protocol" -- for version 1.1.2 (that's the dependency shown in the POM for Spark-HBase module).
Add these JARs (and directory /etc/hbase/conf/) to spark.driver.extraClassPath along with Spark-HBase JAR.
Ship these JARs to the executors via command-line option --jars along with Spark-HBase JAR
(and don't forget directory /etc/hbase/conf/ in spark.executor.extraClassPath if the conf is present on all YARN nodes; otherwise find a way to ship the XML to a directory in their container CLASSPATH)

$$ Issue #2 $$ Somehow, in YARN mode, the Spark executors do not generate correctly the HBase configuration that is passed to methods org.apache.hadoop.hbase.security.UserProvider.instantiate(Configuration) and org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(Configuration, boolean, ExecutorService, User), hence
java.lang.NullPointerException

$$ Workaround #2 $$

Download the HBase source code from GitHub, branch 1.1 for these two classes
Patch the code to make sure that whenever the conf argument is NULL, it is replaced silently with a call to org.apache.hadoop.hbase.HBaseConfiguration.create()
Compile both classes, and replace the original .class executables in the appropriate JARs with your patched versions

It would certainly make more sense to patch the Spark-HBase plug-in (cf. comment from ray3888 in that post) but Scala makes me puke so I stick to plain'old Java.

来源：https://stackoverflow.com/questions/39085959/accessing-a-kererized-remote-hbase-cluster-from-spark

标签

apache-spark

hbase

pyspark

remote-access

kerberos