How to create Azure on demand HD insight Spark cluster using Data Factory

孤街浪徒 提交于 2020-01-05 05:45:06

问题


I am trying to use Azure Data factory to create an on demand HD insight Spark cluster using Hdi Version 3.5. The data factory is refusing to create with an error message

HdiVersion:'3.5' is not supported

If currently there is no way of creating an on Demand HD insight spark cluster, then what is the other sensible option? It seems very strange to me why Microsoft hasn't added an on Demand HD insight Spark Cluster to the Azure Data factory.


回答1:


Here is a full solution, which uses ADF to schedule a Custom .NET activity in C#, which in turn uses the ARM templates, and SSH.NET to execute the command which runs the R script.

So, ADF is used to schedule the .NET Activity, the Batch service is used to run the code in the dll, then the json template file for the HDInsight cluster is sored in blob and can be configured as needed.

The full description is in the article "Automating Azure: Creating an On-Demand HDInsight Cluster", but here is the C# code which is the essence of the automation (everything else is just admin work to setup the bits):

using System;
  using System.Collections.Generic;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Runtime;

using Microsoft.Azure.Management.ResourceManager.Fluent;

using Microsoft.Azure.Management.ResourceManager.Fluent.Core;

 using Renci.SshNet;

 namespace VM

 {
  public class StartVM : IDotNetActivity
  {
      private IActivityLogger _logger;
      public IDictionary<string, string> Execute(
          IEnumerable<LinkedService> linkedServices,
          IEnumerable<Dataset> datasets,
          Activity activity,
          IActivityLogger logger)
      {
          _logger = logger;
          _logger.Write("Starting execution...");
          var credentials = SdkContext.AzureCredentialsFactory.FromServicePrincipal(
               "" // enter clientId here, this is the ApplicationID
               , "" // this is the Application secret key
               , "" // this is the tenant id 
               , AzureEnvironment.AzureGlobalCloud);
          var azure = Microsoft.Azure.Management.Fluent.Azure
              .Configure()
              .WithLogLevel(HttpLoggingDelegatingHandler.Level.Basic)
              .Authenticate(credentials)
              .WithDefaultSubscription();
          var groupName = "myResourceGroup";
          var location = Region.EuropeNorth;
          // create the resource group
          var resourceGroup = azure.ResourceGroups.Define(groupName)
              .WithRegion(location)
              .Create();
          // deploy the template
          var templatePath = "https://myblob.blob.core.windows.net/blobcontainer/myHDI_template.JSON";
          var paramPath = "https:// myblob.blob.core.windows.net/blobcontainer /myHDI_parameters.JSON";
          var deployment = azure.Deployments.Define("myDeployment")
              .WithExistingResourceGroup(groupName)
              .WithTemplateLink(templatePath, "0.9.0.0") // make sure it matches the file
              .WithParametersLink(paramPath, "1.0.0.0") // make sure it matches the file
              .WithMode(Microsoft.Azure.Management.ResourceManager.Fluent.Models.DeploymentMode.Incremental)
              .Create();
    _logger.Write("The cluster is ready...");
          executeSSHCommand();
          _logger.Write("The SSH command was executed...");
          _logger.Write("Deleting the cluster...");
          // delete the resource group
          azure.ResourceGroups.DeleteByName(groupName);

          return new Dictionary<string, string>();
      }
      private void executeSSHCommand()
      {
          ConnectionInfo ConnNfo = new ConnectionInfo("myhdi-ssh.azurehdinsight.net", "sshuser",
              new AuthenticationMethod[]{
              // Pasword based Authentication
              new PasswordAuthenticationMethod("sshuser","Addso@1234523123"),
              }
          );
          // Execute a (SHELL) Command - prepare upload directory
          using (var sshclient = new SshClient(ConnNfo))
          {
              sshclient.Connect();
              using (var cmd = sshclient.CreateCommand(
                  "hdfs dfs -copyToLocal \"wasbs:///rscript/test.R\";env -i R CMD BATCH --no-save --no-restore \"test.R\"; hdfs dfs -copyFromLocal -f \"test-output.txt\" \"wasbs:///rscript/test-output.txt\" "))
              {
                  cmd.Execute();

              }
              sshclient.Disconnect();
          }
      }
  }

  }

Good luck!

Feodor




回答2:


I'm afraid that on-demand Spark isn't currently supported, but it's definitely in the roadmap. Please stay tuned.

As a workaround for now, you may try ADF CustomActivity to create / delete Spark cluster with your custom code.




回答3:


Azure currently doesn't support On Demand HDInsight cluster creation for Spark activity. Since you are asking for workaround, here is what I do:

  1. Bring HDInsight cluster up using Powershell automation & scheduling (runbooks), takes about 20 mins for HDI to be ready.
  2. Submit Spark batch job from Data Factory, about 30 mins later than HDI schedule.
  3. Delete HDI cluster after 30 mins - 1 hour of expected job finish time.

Lots of work for simple task I know, but works for now.



来源:https://stackoverflow.com/questions/43545182/how-to-create-azure-on-demand-hd-insight-spark-cluster-using-data-factory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!