hdinsight | 易学教程

How to use Avro on HDInsight Spark/Jupyter?

阅读更多关于 How to use Avro on HDInsight Spark/Jupyter?

问题 I am trying to read in a avro file inside HDInsight Spark/Jupyter cluster but got u'Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;' Traceback (most recent call last): File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 159, in load return self._df(self._jreader.load(path)) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in _

Backup of Data Lake Store

阅读更多关于 Backup of Data Lake Store

问题 I am working on a backup strategy for Data Lake Store (DLS). My plan is to create two DLS accounts and copy data between them. I have evaluated several approaches to achieve this but none of them satisfies the requirement to preserve the POSIX ACLs (permissions in DLS parlance). PowerShell cmdlets require data to be downloaded from the primary DLS onto a VM and re-uploaded onto the secondary DLS. The AdlCopy tool works only on Windows 10, does not preserve permissions and neither supports

Error in running movie recommendations by using Apache Mahout with HDInsight

阅读更多关于 Error in running movie recommendations by using Apache Mahout with HDInsight

问题 I ran the following code but receiving an error... # The HDInsight cluster name. $clusterName = "my-cluster-name" Use-AzureHDInsightCluster $clusterName # NOTE: The version number portion of the file path # may change in future versions of HDInsight. # So dynamically grab it using Hive. $mahoutPath = Invoke-Hive -Query '!${env:COMSPEC} /c dir /b /s ${env:MAHOUT_HOME}\examples\target\*-job.jar' | where {$_.startswith("C:\apps\dist")} $mahoutPath = $mahoutPath -replace "\\", "/" $jarFile =

Error in running movie recommendations by using Apache Mahout with HDInsight

阅读更多关于 Error in running movie recommendations by using Apache Mahout with HDInsight

insert into where not exists in hive

阅读更多关于 insert into where not exists in hive

问题 I need the hive syntax for this equivalent in ansi sql insert into tablea (id) select id from tableb where id not in (select id from tablea) so tablea contains no duplicates and only new ids from tableb are inserted. 回答1: Use left outer join with a filter that the tableA.id is null: insert overwrite into tableA (id) select b.id from tableB b left outer join tableA a on a.id = b.id where a.id is null 来源： https://stackoverflow.com/questions/20951703/insert-into-where-not-exists-in-hive

Not able to see 'Lifecycle management' option for ADLS Gen2

阅读更多关于 Not able to see 'Lifecycle management' option for ADLS Gen2

问题 I have created ADLS (Azure Data Lake Storage) Gen2 resource (StorageV2 with hierarchical name space enabled). The region I created the resource in is Central US and the performance/access tier is Standard/Hot and replication is LRS. But for this resource I can't see 'Lifecycle management' option on the portal. ADLS Gen2 is simply a StorageV2 account with hierarchical namespace enabled, and since the lifecycle management option exists for StorageV2 as per Microsoft documentation , it should be

Create external table with select from other table

阅读更多关于 Create external table with select from other table

问题 I am using HDInsight and need to delete my clusters when I am finished running queries. However, I need the data I gather to survive for another day. I am working on queries that would create calculated columns from table1 and insert them into table2. First I wanted a simple test to copy the rows. Can you create an external table from a select statement? drop table if exists table2; create external table table2 as select * from table1 STORED AS TEXTFILE LOCATION 'wasb://{container name}@

spark-shell error : No FileSystem for scheme: wasb

阅读更多关于 spark-shell error : No FileSystem for scheme: wasb

问题 We have HDInsight cluster in Azure running, but it doesn't allow to spin up edge/gateway node at the time of cluster creation. So I was creating this edge/gateway node by installing echo 'deb http://private-repo-1.hortonworks.com/HDP/ubuntu14/2.x/updates/2.4.2.0 HDP main' >> /etc/apt/sources.list.d/HDP.list echo 'deb http://private-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/ubuntu14 HDP-UTILS main' >> /etc/apt/sources.list.d/HDP.list echo 'deb [arch=amd64] https://apt-mo.trafficmanager

Create HDCluster using powershell

阅读更多关于 Create HDCluster using powershell

问题 I am trying to create cluster using powershell. Here is script I am executing: $containerName = "hdfiles" $location = "Southeast Asia" $clusterNodes = 2 $userName = "HDUser" #Generate random password $rand = New-Object System.Random $pass = "" $pass = $pass + [char]$rand.next(97,121) #lower case $pass = $pass + [char]$rand.next(48,57) #number $pass = $pass + [char]$rand.next(65,90) #upper case $pass = $pass + [char]$rand.next(58,62) #special character 1..6 | ForEach { $pass = $pass + [char]

How to create a HDInsightOnDemand LinkedService with a script action in Data Factory?

阅读更多关于 How to create a HDInsightOnDemand LinkedService with a script action in Data Factory?

问题 We are creating a DataFactory for running a pySpark job, that uses a HDInsight on demand cluster. The problem is that we need to use additional python dependencies for running this job, such as numpy, that are not installed. We believe that the way of doing so is configuring a Script Action for the HDInsightOnDemandLinkedService, but we cannot find this option on DataFactory or LikedServices. Is there an alternative for automating the HDInsightOnDemand installation of the dependencies? 回答1: