hdinsight

How to use Avro on HDInsight Spark/Jupyter?

限于喜欢 提交于 2020-01-24 03:39:10
问题 I am trying to read in a avro file inside HDInsight Spark/Jupyter cluster but got u'Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;' Traceback (most recent call last): File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 159, in load return self._df(self._jreader.load(path)) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in _

Backup of Data Lake Store

与世无争的帅哥 提交于 2020-01-16 19:46:07
问题 I am working on a backup strategy for Data Lake Store (DLS). My plan is to create two DLS accounts and copy data between them. I have evaluated several approaches to achieve this but none of them satisfies the requirement to preserve the POSIX ACLs (permissions in DLS parlance). PowerShell cmdlets require data to be downloaded from the primary DLS onto a VM and re-uploaded onto the secondary DLS. The AdlCopy tool works only on Windows 10, does not preserve permissions and neither supports

Error in running movie recommendations by using Apache Mahout with HDInsight

孤街浪徒 提交于 2020-01-16 19:40:15
问题 I ran the following code but receiving an error... # The HDInsight cluster name. $clusterName = "my-cluster-name" Use-AzureHDInsightCluster $clusterName # NOTE: The version number portion of the file path # may change in future versions of HDInsight. # So dynamically grab it using Hive. $mahoutPath = Invoke-Hive -Query '!${env:COMSPEC} /c dir /b /s ${env:MAHOUT_HOME}\examples\target\*-job.jar' | where {$_.startswith("C:\apps\dist")} $mahoutPath = $mahoutPath -replace "\\", "/" $jarFile =

Error in running movie recommendations by using Apache Mahout with HDInsight

 ̄綄美尐妖づ 提交于 2020-01-16 19:40:14
问题 I ran the following code but receiving an error... # The HDInsight cluster name. $clusterName = "my-cluster-name" Use-AzureHDInsightCluster $clusterName # NOTE: The version number portion of the file path # may change in future versions of HDInsight. # So dynamically grab it using Hive. $mahoutPath = Invoke-Hive -Query '!${env:COMSPEC} /c dir /b /s ${env:MAHOUT_HOME}\examples\target\*-job.jar' | where {$_.startswith("C:\apps\dist")} $mahoutPath = $mahoutPath -replace "\\", "/" $jarFile =

insert into where not exists in hive

寵の児 提交于 2020-01-14 05:20:08
问题 I need the hive syntax for this equivalent in ansi sql insert into tablea (id) select id from tableb where id not in (select id from tablea) so tablea contains no duplicates and only new ids from tableb are inserted. 回答1: Use left outer join with a filter that the tableA.id is null: insert overwrite into tableA (id) select b.id from tableB b left outer join tableA a on a.id = b.id where a.id is null 来源: https://stackoverflow.com/questions/20951703/insert-into-where-not-exists-in-hive

Not able to see 'Lifecycle management' option for ADLS Gen2

前提是你 提交于 2020-01-14 04:34:25
问题 I have created ADLS (Azure Data Lake Storage) Gen2 resource (StorageV2 with hierarchical name space enabled). The region I created the resource in is Central US and the performance/access tier is Standard/Hot and replication is LRS. But for this resource I can't see 'Lifecycle management' option on the portal. ADLS Gen2 is simply a StorageV2 account with hierarchical namespace enabled, and since the lifecycle management option exists for StorageV2 as per Microsoft documentation , it should be

Create external table with select from other table

别等时光非礼了梦想. 提交于 2020-01-12 07:14:12
问题 I am using HDInsight and need to delete my clusters when I am finished running queries. However, I need the data I gather to survive for another day. I am working on queries that would create calculated columns from table1 and insert them into table2. First I wanted a simple test to copy the rows. Can you create an external table from a select statement? drop table if exists table2; create external table table2 as select * from table1 STORED AS TEXTFILE LOCATION 'wasb://{container name}@

spark-shell error : No FileSystem for scheme: wasb

我怕爱的太早我们不能终老 提交于 2020-01-10 20:10:10
问题 We have HDInsight cluster in Azure running, but it doesn't allow to spin up edge/gateway node at the time of cluster creation. So I was creating this edge/gateway node by installing echo 'deb http://private-repo-1.hortonworks.com/HDP/ubuntu14/2.x/updates/2.4.2.0 HDP main' >> /etc/apt/sources.list.d/HDP.list echo 'deb http://private-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/ubuntu14 HDP-UTILS main' >> /etc/apt/sources.list.d/HDP.list echo 'deb [arch=amd64] https://apt-mo.trafficmanager

Create HDCluster using powershell

孤街醉人 提交于 2020-01-06 06:53:07
问题 I am trying to create cluster using powershell. Here is script I am executing: $containerName = "hdfiles" $location = "Southeast Asia" $clusterNodes = 2 $userName = "HDUser" #Generate random password $rand = New-Object System.Random $pass = "" $pass = $pass + [char]$rand.next(97,121) #lower case $pass = $pass + [char]$rand.next(48,57) #number $pass = $pass + [char]$rand.next(65,90) #upper case $pass = $pass + [char]$rand.next(58,62) #special character 1..6 | ForEach { $pass = $pass + [char]

How to create a HDInsightOnDemand LinkedService with a script action in Data Factory?

纵饮孤独 提交于 2020-01-03 02:47:07
问题 We are creating a DataFactory for running a pySpark job, that uses a HDInsight on demand cluster. The problem is that we need to use additional python dependencies for running this job, such as numpy, that are not installed. We believe that the way of doing so is configuring a Script Action for the HDInsightOnDemandLinkedService, but we cannot find this option on DataFactory or LikedServices. Is there an alternative for automating the HDInsightOnDemand installation of the dependencies? 回答1: