问题
I am trying to use Azure Data factory to create an on demand HD insight Spark cluster using Hdi Version 3.5. The data factory is refusing to create with an error message
HdiVersion:'3.5' is not supported
If currently there is no way of creating an on Demand HD insight spark cluster, then what is the other sensible option? It seems very strange to me why Microsoft hasn't added an on Demand HD insight Spark Cluster to the Azure Data factory.
回答1:
Here is a full solution, which uses ADF to schedule a Custom .NET activity in C#, which in turn uses the ARM templates, and SSH.NET to execute the command which runs the R script.
So, ADF is used to schedule the .NET Activity, the Batch service is used to run the code in the dll, then the json template file for the HDInsight cluster is sored in blob and can be configured as needed.
The full description is in the article "Automating Azure: Creating an On-Demand HDInsight Cluster", but here is the C# code which is the essence of the automation (everything else is just admin work to setup the bits):
using System;
using System.Collections.Generic;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Runtime;
using Microsoft.Azure.Management.ResourceManager.Fluent;
using Microsoft.Azure.Management.ResourceManager.Fluent.Core;
using Renci.SshNet;
namespace VM
{
public class StartVM : IDotNetActivity
{
private IActivityLogger _logger;
public IDictionary<string, string> Execute(
IEnumerable<LinkedService> linkedServices,
IEnumerable<Dataset> datasets,
Activity activity,
IActivityLogger logger)
{
_logger = logger;
_logger.Write("Starting execution...");
var credentials = SdkContext.AzureCredentialsFactory.FromServicePrincipal(
"" // enter clientId here, this is the ApplicationID
, "" // this is the Application secret key
, "" // this is the tenant id
, AzureEnvironment.AzureGlobalCloud);
var azure = Microsoft.Azure.Management.Fluent.Azure
.Configure()
.WithLogLevel(HttpLoggingDelegatingHandler.Level.Basic)
.Authenticate(credentials)
.WithDefaultSubscription();
var groupName = "myResourceGroup";
var location = Region.EuropeNorth;
// create the resource group
var resourceGroup = azure.ResourceGroups.Define(groupName)
.WithRegion(location)
.Create();
// deploy the template
var templatePath = "https://myblob.blob.core.windows.net/blobcontainer/myHDI_template.JSON";
var paramPath = "https:// myblob.blob.core.windows.net/blobcontainer /myHDI_parameters.JSON";
var deployment = azure.Deployments.Define("myDeployment")
.WithExistingResourceGroup(groupName)
.WithTemplateLink(templatePath, "0.9.0.0") // make sure it matches the file
.WithParametersLink(paramPath, "1.0.0.0") // make sure it matches the file
.WithMode(Microsoft.Azure.Management.ResourceManager.Fluent.Models.DeploymentMode.Incremental)
.Create();
_logger.Write("The cluster is ready...");
executeSSHCommand();
_logger.Write("The SSH command was executed...");
_logger.Write("Deleting the cluster...");
// delete the resource group
azure.ResourceGroups.DeleteByName(groupName);
return new Dictionary<string, string>();
}
private void executeSSHCommand()
{
ConnectionInfo ConnNfo = new ConnectionInfo("myhdi-ssh.azurehdinsight.net", "sshuser",
new AuthenticationMethod[]{
// Pasword based Authentication
new PasswordAuthenticationMethod("sshuser","Addso@1234523123"),
}
);
// Execute a (SHELL) Command - prepare upload directory
using (var sshclient = new SshClient(ConnNfo))
{
sshclient.Connect();
using (var cmd = sshclient.CreateCommand(
"hdfs dfs -copyToLocal \"wasbs:///rscript/test.R\";env -i R CMD BATCH --no-save --no-restore \"test.R\"; hdfs dfs -copyFromLocal -f \"test-output.txt\" \"wasbs:///rscript/test-output.txt\" "))
{
cmd.Execute();
}
sshclient.Disconnect();
}
}
}
}
Good luck!
Feodor
回答2:
I'm afraid that on-demand Spark isn't currently supported, but it's definitely in the roadmap. Please stay tuned.
As a workaround for now, you may try ADF CustomActivity to create / delete Spark cluster with your custom code.
回答3:
Azure currently doesn't support On Demand HDInsight cluster creation for Spark activity. Since you are asking for workaround, here is what I do:
- Bring HDInsight cluster up using Powershell automation & scheduling (runbooks), takes about 20 mins for HDI to be ready.
- Submit Spark batch job from Data Factory, about 30 mins later than HDI schedule.
- Delete HDI cluster after 30 mins - 1 hour of expected job finish time.
Lots of work for simple task I know, but works for now.
来源:https://stackoverflow.com/questions/43545182/how-to-create-azure-on-demand-hd-insight-spark-cluster-using-data-factory