apache-spark | 易学教程

Handling empty arrays in pySpark (optional binary element (UTF8) is not a group)

阅读更多关于 Handling empty arrays in pySpark (optional binary element (UTF8) is not a group)

问题 I have a json-like structure in spark which looks as follows: >>> df = spark.read.parquet(good_partition_path) id: string some-array: array element: struct array-field-1: string array-field-2: string depending on the partition, some-array might be an empty array for all id 's. When this happend spark infers the following schema: >>> df = spark.read.parquet(bad_partition_path) id: string some-array: array element: string Of course that's a problem if I want to read multiple partitions because

Spark read CSV - Not showing corroupt Records

阅读更多关于 Spark read CSV - Not showing corroupt Records

问题 Spark has a Permissive mode for reading CSV files which stores the corroupt records into a separate column named _corroupt_record . permissive - Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column called _corrupt_record However, when I am trying following example, I don't see any column named _corroupt_record . the reocords which doesn't match with schema appears to be null data.csv data 10.00 11.00 $12.00 $13 gaurang code import

Spark read Parquet files of different versions

阅读更多关于 Spark read Parquet files of different versions

问题 I have parquet files generated for over a year with a Version1 schema. And with a recent schema change the newer parquet files have Version2 schema extra columns. So when i load parquet files from the old version and new version together and try to filter on the changed columns i get an exception. I would like for spark to read old and new files and fill in null values where the column is not present.Is there a workaround for this where spark fills null values when the column is not found?

Spark Dataset cache is using only one executor

阅读更多关于 Spark Dataset cache is using only one executor

问题 I have a process which reads hive(parquet-snappy) table and builds a dataset of 2GB. It is iterative(~ 7K) process and This dataset is going to be the same for all iterations so I decided to cache the dataset. Somehow cache task is done on one executor only and seems like the cache is on that one executor only. which leads in delay, OOM etc. Is it because of parquet? How to make sure that cache is distributed on multiple executors? Here is the spark config: Executors : 3 Core: 4 Memory: 4GB

How to parse a YAML with spark/scala

阅读更多关于 How to parse a YAML with spark/scala

问题 I have yaml file with following details. file name : config.yml - firstName: "James" lastName: "Bond" age: 30 - firstName: "Super" lastName: "Man" age: 25 From this I need to get a spark dataframe using spark with scala +---+---------+--------+ |age|firstName|lastName| +---+---------+--------+ |30 |James |Bond | |25 |Super |Man | +---+---------+--------+ I have tried converting to json and then to dataframe, but I am not able to specify it in a dataset sequence. 回答1: There is a solution, that

What are the pros and cons of java serialization vs kryo serialization?

阅读更多关于 What are the pros and cons of java serialization vs kryo serialization?

问题 In spark, java serialization is the default, if kryo is that efficient then why it is not set as default. Is there some cons using kryo or in what scenarios we should use kryo or java serialization? 回答1: Here is comment from documentation: Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance. So it is not used

Implicit schema for pandas_udf in PySpark?

阅读更多关于 Implicit schema for pandas_udf in PySpark?

问题 This answer nicely explains how to use pyspark's groupby and pandas_udf to do custom aggregations. However, I cannot possibly declare my schema manually as shown in this part of the example from pyspark.sql.types import * schema = StructType([ StructField("key", StringType()), StructField("avg_min", DoubleType()) ]) since I will be returning 100+ columns with names that are automatically generated. Is there any way to tell PySpark to just implicetely use the Schema returned by my function and

What does checkpointing do on Apache Spark?

阅读更多关于 What does checkpointing do on Apache Spark?

问题 What does checkpointing do for Apache Spark, and does it take any hits on RAM or CPU? 回答1: From Apache Streaming Documentation - Hope it helps: A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures. There are two types of data that are

Implicit schema for pandas_udf in PySpark?

阅读更多关于 Implicit schema for pandas_udf in PySpark?

spark test on local machine

阅读更多关于 spark test on local machine

问题 I am running unit tests on Spark 1.3.1 with sbt test and besides the unit tests being incredibly slow I keep running into java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId issues. Usually this means a dependency issue, but I wouldn't know from where. Tried installing everything on a new machine, including fresh hadoop, fresh ivy2, but I still run into the same issue Any help is greatly appreciated Exception: Exception in thread "Driver Heartbeater" java.lang