google-cloud-dataproc

Spark UI appears with wrong format (broken CSS)

若如初见. 提交于 2020-06-11 05:26:09
问题 I am using Apache Spark for the first time. I run my application and when I access localhost:4040 I get what is shown in the picture. I found that maybe setting spark.ui.enabled true could help but I don't know how to do that. Thanks in advance. 回答1: I have faced the same issue while using Spark on Google Cloud Dataproc. If you will access Spark Job UI not through 4040 port directly, but through YARN Web UI ( 8088 port) you will see correctly rendered web pages. To workaround this issue when

Submit a Python project to Dataproc job

☆樱花仙子☆ 提交于 2020-06-08 19:15:48
问题 I have a python project, whose folder has the structure main_directory - lib - lib.py - run - script.py script.py is from lib.lib import add_two spark = SparkSession \ .builder \ .master('yarn') \ .appName('script') \ .getOrCreate() print(add_two(1,2)) and lib.py is def add_two(x,y): return x+y I want to launch as a Dataproc job in GCP. I have checked online, but I have not understood well how to do it. I am trying to launch the script with gcloud dataproc jobs submit pyspark --cluster=

Submit a Python project to Dataproc job

痴心易碎 提交于 2020-06-08 19:13:33
问题 I have a python project, whose folder has the structure main_directory - lib - lib.py - run - script.py script.py is from lib.lib import add_two spark = SparkSession \ .builder \ .master('yarn') \ .appName('script') \ .getOrCreate() print(add_two(1,2)) and lib.py is def add_two(x,y): return x+y I want to launch as a Dataproc job in GCP. I have checked online, but I have not understood well how to do it. I am trying to launch the script with gcloud dataproc jobs submit pyspark --cluster=

How to install python packages in a Google Dataproc cluster

廉价感情. 提交于 2020-05-28 23:20:14
问题 Is it possible to install python packages in a Google Dataproc cluster after the cluster is created and running? I tried to use " pip install xxxxxxx " in the master command line but it does not seem to work. Google's Dataproc documentation does not mention this situation. 回答1: This is generally not possible after cluster is created. I recommend using an initialization action to do this. As you've noticed, pip is also not available by default. So you'll want to run easy_install pip followed

How to connect with JMX remotely to Spark worker on Dataproc

◇◆丶佛笑我妖孽 提交于 2020-04-13 05:57:11
问题 I can connect to the driver just fine by adding the following: spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=9178 \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.ssl=false But doing ... spark.executor.extraJavaOptions=-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=9178 \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.ssl=false ... only yield a

Sqoop on Hadoop: NoSuchMethodError: com.google.common.base.Stopwatch.createStarted() [duplicate]

风格不统一 提交于 2020-04-11 08:03:10
问题 This question already has an answer here : How to resolve Guava dependency issue while submitting Uber Jar to Google Dataproc (1 answer) Closed 3 months ago . I'm running sqoop on hadoop on Google Cloud DataProc to access postgresql via the Cloud SQL Proxy but I'm getting a Java dependency error: INFO: First Cloud SQL connection, generating RSA key pair. Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun

While submit job with pyspark, how to access static files upload with --files argument?

跟風遠走 提交于 2020-03-11 02:47:45
问题 for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py , I want to access the static file I uploaded. with open('test.yml') as test_file: logging.info(test_file.read()) but got the following exception: IOError: [Errno 2] No such file or directory: 'test.yml' How to access the file I uploaded? 回答1: Files distributed using SparkContext.addFile (and --files ) can be

Workflow scheduling on GCP Dataproc cluster

落爺英雄遲暮 提交于 2020-02-24 03:56:08
问题 I have some complex Oozie workflows to migrate from on-prem Hadoop to GCP Dataproc. Workflows consist of shell-scripts, Python scripts, Spark-Scala jobs, Sqoop jobs etc. I have come across some potential solutions incorporating my workflow scheduling needs: Cloud Composer Dataproc Workflow Template with Cloud Scheduling Install Oozie on Dataproc auto-scaling cluster Please let me know which option would be most efficient in terms of performance, costing and migration complexities. 回答1: All 3

While submit job with pyspark, how to access static files upload with --files argument?

纵然是瞬间 提交于 2020-02-23 08:34:28
问题 for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py , I want to access the static file I uploaded. with open('test.yml') as test_file: logging.info(test_file.read()) but got the following exception: IOError: [Errno 2] No such file or directory: 'test.yml' How to access the file I uploaded? 回答1: Files distributed using SparkContext.addFile (and --files ) can be

While submit job with pyspark, how to access static files upload with --files argument?

眉间皱痕 提交于 2020-02-16 11:32:39
问题 for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py , I want to access the static file I uploaded. with open('test.yml') as test_file: logging.info(test_file.read()) but got the following exception: IOError: [Errno 2] No such file or directory: 'test.yml' How to access the file I uploaded? 回答1: Files distributed using SparkContext.addFile (and --files ) can be