apache-spark

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

拥有回忆 提交于 2021-01-29 07:37:03
问题 Spark Dataframe Schema: StructType( [StructField("a", StringType(), False), StructField("b", StringType(), True), StructField("c" , BinaryType(), False), StructField("d", ArrayType(StringType(), False), True), StructField("e", TimestampType(), True) ]) When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe. BigQuery Schema: [ { "type": "STRING", "name": "a", "mode":

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

让人想犯罪 __ 提交于 2021-01-29 07:31:20
问题 Spark Dataframe Schema: StructType( [StructField("a", StringType(), False), StructField("b", StringType(), True), StructField("c" , BinaryType(), False), StructField("d", ArrayType(StringType(), False), True), StructField("e", TimestampType(), True) ]) When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe. BigQuery Schema: [ { "type": "STRING", "name": "a", "mode":

Spark program difference in local mode and cluster

一曲冷凌霜 提交于 2021-01-29 07:30:36
问题 If i write a spark program and run it in stand alone mode and when I want to deploy it in a cluster, do I have to change my program codes or no change in codes needed? Is spark programming independent of number of clusters? 回答1: I don't think you need to make any changes. Your program should run the same way as it run in local mode. Yes, Spark programs are independent of clusters, until and unless you are using something specific to cluster. Normally this is managed by the YARN. 回答2: You just

Databricks dbutils throwing NullPointerException

余生长醉 提交于 2021-01-29 07:22:02
问题 Trying to read secret from Azure Key Vault using databricks dbutils, but facing the following exception: OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds Exception in thread "main" java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect

How to recover from checkpoint when using python spark direct approach?

一笑奈何 提交于 2021-01-29 07:19:36
问题 After read official docs, i tried using checkpoint with getOrCreate in spark streaming. Some snippets: def get_ssc(): sc = SparkContext("yarn-client") ssc = StreamingContext(sc, 10) # calc every 10s ks = KafkaUtils.createDirectStream( ssc, ['lucky-track'], {"metadata.broker.list": KAFKA_BROKER}) process_data(ks) ssc.checkpoint(CHECKPOINT_DIR) return ssc if __name__ == '__main__': ssc = StreamingContext.getOrCreate(CHECKPOINT_DIR, get_ssc) ssc.start() ssc.awaitTermination() The code works fine

Reading property file from external path in spark scala throwing error

核能气质少年 提交于 2021-01-29 07:06:30
问题 I am trying to read property file from external path in spark scala like this : spark-submit --class com.spark.scala.my.class --deploy-mode cluster --master yarn --files /user/mine/dev.properties /path/to/jar/dev-0.0.1-SNAPSHOT-uber.jar 2020-08-19T06:00:00Z 2020-08-20T07:00:00Z and I am reading like this: val props = new Properties() val filePath = SparkFiles.get("/user/mine/dev.properties") LOGGER.info("Path to file : "+ filePath) val is= Source.fromFile(filePath) props.load(is

Pyspark: dynamically generate condition for when() clause during runtime

孤人 提交于 2021-01-29 06:37:29
问题 I have read a csv file into pyspark dataframe . Now if I apply conditions in when() clause, it works fine when the conditions are given before runtime . import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import functions from pyspark.sql.functions import col sc = SparkContext('local', 'example') sql_sc = SQLContext(sc) pandas_df = pd.read_csv('file.csv') # assuming the file contains a header # Sample content of csv file # col1,value # 1,aa

How to read a CSV file with multiple delimiter in spark

本小妞迷上赌 提交于 2021-01-29 06:00:07
问题 I am trying to read a CSV file using spark 1.6 s.no|Name$id|designation|salry 1 |abc$12 |xxx |yyy val df = spark.read.format("csv") .option("header","true") .option("delimiter","|") .load("path") if I add delimiter with $ it throwing error one delimiter permitted 回答1: You can apply operation once dataframe is created after reading it from source with primary delimiter ( I am referring "|" as primary delimiter for better understanding). You can do something like below: sc is the Sparksession

Scala spark: Create List of Dataset from a Dataset map operation

旧巷老猫 提交于 2021-01-29 05:42:35
问题 Suppose I want to create 2 types of metric : metricA or metricB after transforming another dataset. If a certain condition is met, it'll generate both metricA and B, if condition is not met, generate only metric A. The idea is to write the 2 metrics to 2 different paths (pathA, pathB). The approach I took was to create a Dataset of GeneralMetric and then based on whats inside, write to different paths, but obviously it didn't work as pattern matching inside Dataset wouldn't work val s:

Dividing dataframes in pyspark

不想你离开。 提交于 2021-01-29 05:33:59
问题 Following up this question and dataframes, I am trying to convert this Into this (I know it looks the same, but refer to the next code line to see the difference): In pandas, I used the line code teste_2 = (value/value.groupby(level=0).sum()) and in pyspark I tried several solutions; the first one was: df_2 = (df/df.groupby(["age"]).sum()) However, I am getting the following error: TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame' The second one was: df_2 = (df.filter