Dataproc + BigQuery examples - any available?

前端 未结 2 822
小鲜肉
小鲜肉 2020-12-10 05:12

According to the Dataproc docos, it has \"native and automatic integrations with BigQuery\".

I have a table in BigQuery. I want to read that table and perfo

相关标签:
2条回答
  • 2020-12-10 05:31

    The above example doesn't show how to write data to an output table. You need to do this:

    .saveAsNewAPIHadoopFile(
    hadoopConf.get(BigQueryConfiguration.TEMP_GCS_PATH_KEY), 
    classOf[String], 
    classOf[JsonObject], 
    classOf[BigQueryOutputFormat[String, JsonObject]], hadoopConf)
    

    where the key: String is actually ignored

    0 讨论(0)
  • 2020-12-10 05:54

    To begin, as noted in this question the BigQuery connector is preinstalled on Cloud Dataproc clusters.

    Here is an example on how to read data from BigQuery into Spark. In this example, we will read data from BigQuery to perform a word count. You read data from BigQuery in Spark using SparkContext.newAPIHadoopRDD. The Spark documentation has more information about using SparkContext.newAPIHadoopRDD. '

    import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
    import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
    import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
    import com.google.gson.JsonObject
    
    import org.apache.hadoop.io.LongWritable
    
    val projectId = "<your-project-id>"
    val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
    val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"
    val outputTableSchema =
        "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"
    val jobName = "wordcount"
    
    val conf = sc.hadoopConfiguration
    
    // Set the job-level projectId.
    conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
    
    // Use the systemBucket for temporary BigQuery export data used by the InputFormat.
    val systemBucket = conf.get("fs.gs.system.bucket")
    conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)
    
    // Configure input and output for BigQuery access.
    BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
    BigQueryConfiguration.configureBigQueryOutput(conf,
        fullyQualifiedOutputTableId, outputTableSchema)
    
    val fieldName = "word"
    
    val tableData = sc.newAPIHadoopRDD(conf,
        classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
    tableData.cache()
    tableData.count()
    tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)
    

    You will need to customize this example with your settings, including your Cloud Platform project ID in <your-project-id> and your output table ID in <your-fully-qualified-table-id>.

    Finally, if you end up using the BigQuery connector with MapReduce, this page has examples for how to write MapReduce jobs with the BigQuery connector.

    0 讨论(0)
提交回复
热议问题