apache-spark | 易学教程

Spark Error: Unable to find encoder for type stored in a Dataset

阅读更多关于 Spark Error: Unable to find encoder for type stored in a Dataset

问题 I am using Spark on a Zeppelin notebook, and groupByKey() does not seem to be working. This code: df.groupByKey(row => row.getLong(0)) .mapGroups((key, iterable) => println(key)) Gives me this error (presumably a compilation error, since it shows up in no time while the dataset I am working on is pretty big): error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for

How to find the index of the maximum value in a vector column?

阅读更多关于 How to find the index of the maximum value in a vector column?

问题 I have a Spark DataFrame with the following structure: root |-- distribution: vector (nullable = true) +--------------------+ | topicDistribution| +--------------------+ | [0.1, 0.2] | | [0.3, 0.2] | | [0.5, 0.2] | | [0.1, 0.7] | | [0.1, 0.8] | | [0.1, 0.9] | +--------------------+ My question is: How to add a column with the index of the maximum value for each row? It should be something like this: root |-- distribution: vector (nullable = true) |-- max_index: integer (nullable = true) +----

How to find the index of the maximum value in a vector column?

阅读更多关于 How to find the index of the maximum value in a vector column?

How to find the index of the maximum value in a vector column?

阅读更多关于 How to find the index of the maximum value in a vector column?

Spark efficiently filtering entries from big dataframe that exist in a small dataframe

阅读更多关于 Spark efficiently filtering entries from big dataframe that exist in a small dataframe

问题 I have a Spark program that reads a relatively big dataframe (~3.2 terabyte) that contains 2 columns: id, name and another relatively small dataframe (~20k entries) that contain a single column: id What I'm trying to do is take both the id and the name from the big dataframe if they appear in the small dataframe I was wondering what would be an efficient solution to get this working and why? Several options I had in mind: Broadcast join the 2 dataframes Broadcast the small dataframe and

Spark efficiently filtering entries from big dataframe that exist in a small dataframe

阅读更多关于 Spark efficiently filtering entries from big dataframe that exist in a small dataframe

How to find the index of the maximum value in a vector column?

阅读更多关于 How to find the index of the maximum value in a vector column?

Spark 2.0.0: SparkR CSV Import

阅读更多关于 Spark 2.0.0: SparkR CSV Import

问题 I am trying to read a csv file into SparkR (running Spark 2.0.0) - & trying to experiment with the newly added features. Using RStudio here. I am getting an error while "reading" the source file. My code: Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.6") library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) sparkR.session(master = "local[*]", appName = "SparkR") df <- loadDF("F:/file.csv", "csv", header = "true") I get an error at at the loadDF function. The

Spark 2.0.0: SparkR CSV Import

阅读更多关于 Spark 2.0.0: SparkR CSV Import

Spark 2.0.0: SparkR CSV Import

阅读更多关于 Spark 2.0.0: SparkR CSV Import