apache-spark | 易学教程

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

阅读更多关于 Spark writing Parquet array converts to a different datatype when loading into BigQuery

问题 Spark Dataframe Schema: StructType( [StructField("a", StringType(), False), StructField("b", StringType(), True), StructField("c" , BinaryType(), False), StructField("d", ArrayType(StringType(), False), True), StructField("e", TimestampType(), True) ]) When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe. BigQuery Schema: [ { "type": "STRING", "name": "a", "mode":

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

阅读更多关于 Spark writing Parquet array converts to a different datatype when loading into BigQuery

Spark program difference in local mode and cluster

阅读更多关于 Spark program difference in local mode and cluster

问题 If i write a spark program and run it in stand alone mode and when I want to deploy it in a cluster, do I have to change my program codes or no change in codes needed? Is spark programming independent of number of clusters? 回答1: I don't think you need to make any changes. Your program should run the same way as it run in local mode. Yes, Spark programs are independent of clusters, until and unless you are using something specific to cluster. Normally this is managed by the YARN. 回答2: You just

Databricks dbutils throwing NullPointerException

阅读更多关于 Databricks dbutils throwing NullPointerException

问题 Trying to read secret from Azure Key Vault using databricks dbutils, but facing the following exception: OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds Exception in thread "main" java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect

How to recover from checkpoint when using python spark direct approach?

阅读更多关于 How to recover from checkpoint when using python spark direct approach?

问题 After read official docs, i tried using checkpoint with getOrCreate in spark streaming. Some snippets: def get_ssc(): sc = SparkContext("yarn-client") ssc = StreamingContext(sc, 10) # calc every 10s ks = KafkaUtils.createDirectStream( ssc, ['lucky-track'], {"metadata.broker.list": KAFKA_BROKER}) process_data(ks) ssc.checkpoint(CHECKPOINT_DIR) return ssc if __name__ == '__main__': ssc = StreamingContext.getOrCreate(CHECKPOINT_DIR, get_ssc) ssc.start() ssc.awaitTermination() The code works fine

Reading property file from external path in spark scala throwing error

阅读更多关于 Reading property file from external path in spark scala throwing error

问题 I am trying to read property file from external path in spark scala like this : spark-submit --class com.spark.scala.my.class --deploy-mode cluster --master yarn --files /user/mine/dev.properties /path/to/jar/dev-0.0.1-SNAPSHOT-uber.jar 2020-08-19T06:00:00Z 2020-08-20T07:00:00Z and I am reading like this: val props = new Properties() val filePath = SparkFiles.get("/user/mine/dev.properties") LOGGER.info("Path to file : "+ filePath) val is= Source.fromFile(filePath) props.load(is

Pyspark: dynamically generate condition for when() clause during runtime

阅读更多关于 Pyspark: dynamically generate condition for when() clause during runtime

问题 I have read a csv file into pyspark dataframe . Now if I apply conditions in when() clause, it works fine when the conditions are given before runtime . import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import functions from pyspark.sql.functions import col sc = SparkContext('local', 'example') sql_sc = SQLContext(sc) pandas_df = pd.read_csv('file.csv') # assuming the file contains a header # Sample content of csv file # col1,value # 1,aa

How to read a CSV file with multiple delimiter in spark

阅读更多关于 How to read a CSV file with multiple delimiter in spark

问题 I am trying to read a CSV file using spark 1.6 s.no|Name$id|designation|salry 1 |abc$12 |xxx |yyy val df = spark.read.format("csv") .option("header","true") .option("delimiter","|") .load("path") if I add delimiter with $ it throwing error one delimiter permitted 回答1: You can apply operation once dataframe is created after reading it from source with primary delimiter ( I am referring "|" as primary delimiter for better understanding). You can do something like below: sc is the Sparksession

Scala spark: Create List of Dataset from a Dataset map operation

阅读更多关于 Scala spark: Create List of Dataset from a Dataset map operation

问题 Suppose I want to create 2 types of metric : metricA or metricB after transforming another dataset. If a certain condition is met, it'll generate both metricA and B, if condition is not met, generate only metric A. The idea is to write the 2 metrics to 2 different paths (pathA, pathB). The approach I took was to create a Dataset of GeneralMetric and then based on whats inside, write to different paths, but obviously it didn't work as pattern matching inside Dataset wouldn't work val s:

Dividing dataframes in pyspark

阅读更多关于 Dividing dataframes in pyspark

问题 Following up this question and dataframes, I am trying to convert this Into this (I know it looks the same, but refer to the next code line to see the difference): In pandas, I used the line code teste_2 = (value/value.groupby(level=0).sum()) and in pyspark I tried several solutions; the first one was: df_2 = (df/df.groupby(["age"]).sum()) However, I am getting the following error: TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame' The second one was: df_2 = (df.filter