google-cloud-dataproc | 易学教程

How can we visualize the Dataproc job status in Google Cloud Plarform?

阅读更多关于 How can we visualize the Dataproc job status in Google Cloud Plarform?

问题 How can we visualize (via Dashboards) the Dataproc job status in Google Cloud Platform? We want to check if jobs are running or not, in addition of their status like running, delay, blocked. On top of it we want to set alerting (Stackdriver Alerting) as well. 回答1: In this page, you have all the metrics available in Stackdriver https://cloud.google.com/monitoring/api/metrics_gcp#gcp-dataproc You could use cluster/job/submitted_count , cluster/job/failed_count and cluster/job/running_count to

how to create dataproc cluster by service account

阅读更多关于 how to create dataproc cluster by service account

问题 I am quite confused by this document enter link description here Service account requirements and Limitations: * Service accounts can only be set when a cluster is created. * You need to create a service account before creating the Cloud Dataproc cluster that will be associated with the service account. * Once set, the service account used for a cluster cannot be changed. Dose this means I cannot create a service account, which have role to create a dataproc cluster? For Now, I can only

Why does Spark (on Google Dataproc) not use all vcores?

阅读更多关于 Why does Spark (on Google Dataproc) not use all vcores?

问题 I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below Based on some other questions like this and this, i have setup the cluster to use DominantResourceCalculator to consider both vcpus and memory for resource allocation gcloud dataproc clusters create cluster_name --bucket="profiling- job-default" \ --zone=europe-west1-c \ --master-boot-disk-size=500GB \ --worker-boot-disk-size=500GB \ --master

BigQuery connector ClassNotFoundException in PySpark on Dataproc

阅读更多关于 BigQuery connector ClassNotFoundException in PySpark on Dataproc

问题 I'm trying to run a script in PySpark, using Dataproc. The script is kind of a merge between this example and what I need to do, as I wanted to check if everything works. Obviously, it doesn't. The error I get is: File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : java.lang.ClassNotFoundException: com.google.cloud.hadoop.io

Error connecting to BigQuery from Dataproc with Datalab using BigQuery Spark connector (Error getting access token from metadata server at)

阅读更多关于 Error connecting to BigQuery from Dataproc with Datalab using BigQuery Spark connector (Error getting access token from metadata server at)

问题 I have BigQuery table, Dataproc cluster (with Datalab) and I follow this guide: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example bucket = spark._jsc.hadoopConfiguration().get("fs.gs.system.bucket") project = spark._jsc.hadoopConfiguration().get("fs.gs.project.id") # Set an input directory for reading data from Bigquery. todays_date = datetime.strftime(datetime.today(), "%Y-%m-%d-%H-%M-%S") input_directory = "gs://{}/tmp/bigquery-{}".format(bucket, todays_date)

Spark HBase/BigTable - Wide/sparse dataframe persistence

阅读更多关于 Spark HBase/BigTable - Wide/sparse dataframe persistence

问题 I want to persist to BigTable a very wide Spark Dataframe (>100'000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost). Is there a way to specify in Spark to ignore nulls when writing? Thanks ! 来源： https://stackoverflow.com/questions/65647574/spark-hbase-bigtable-wide-sparse-dataframe-persistence

Execute bash script on a dataproc cluster from a composer

阅读更多关于 Execute bash script on a dataproc cluster from a composer

问题 I wanted to add jars to a dataproc cluster in a specific location once the cluster has been created using a simple shell script. I would like to automate this step to run from a composer once the dataproc cluster has been created,the next step is to execute bash script which would add the jars to the data proc cluster. Can you suggest which airflow operator to use to execute bash scripts on the dataproc cluster? 回答1: For running a simple shell script on the master node, the easiest way would

How to set optional properties on Google Dataproc Cluster using Java API?

阅读更多关于 How to set optional properties on Google Dataproc Cluster using Java API?

问题 I am trying to create Dataproc cluster using Java API following this documentation https://cloud.google.com/dataproc/docs/quickstarts/quickstart-lib Sample code is as below public static void createCluster() throws IOException, InterruptedException { // TODO(developer): Replace these variables before running the sample. String projectId = "your-project-id"; String region = "your-project-region"; String clusterName = "your-cluster-name"; createCluster(projectId, region, clusterName); } public

GCP Dataproc custom image Python environment

阅读更多关于 GCP Dataproc custom image Python environment

问题 I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env variable to force pyspark to use python3. But when I submit a job on a cluster created (with single node flag for simplicity) with this image, the job can't find the packages installed. If I log on the cluster machine and run the pyspark command, starts

Can't create a Python 3 notebook in jupyter notebook

阅读更多关于 Can't create a Python 3 notebook in jupyter notebook

问题 I'm following this tutorial and I'm stuck when I want to create a new Jupyter Notebook (Python 3). The cluster is created using this command: gcloud beta dataproc clusters create ${CLUSTER_NAME} \ --region=${REGION} \ --image-version=1.4 \ --master-machine-type=n1-standard-4 \ --worker-machine-type=n1-standard-4 \ --bucket=${BUCKET_NAME} \ --optional-components=ANACONDA,JUPYTER \ --enable-component-gateway When I accessing the JupyterLab and try to create a new notebook I can see: and then