apache-spark-dataset

Spark DataSet filter performance

我是研究僧i 提交于 2019-12-03 08:22:14
I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different. The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class. val df = spark.read.csv(csvFile).as[FireIncident] A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below: 1) Use typed column (~ 500 ms on local host) df.where($"UnitID" === "B02").count() 2) Use temp table and sql query (~ same as option 1) df.createOrReplaceTempView("FireIncidentsSF") spark

Spark Dataset select with typedcolumn

℡╲_俬逩灬. 提交于 2019-12-03 07:14:56
Looking at the select() function on the spark DataSet there are various generated function signatures: (c1: TypedColumn[MyClass, U1],c2: TypedColumn[MyClass, U2] ....) This seems to hint that I should be able to reference the members of MyClass directly and be type safe, but I'm not sure how... ds.select("member") of course works .. seems like ds.select(_.member) might also work somehow? In the Scala DSL for select , there are many ways to identify a Column : From a symbol: 'name From a string: $"name" or col(name) From an expression: expr("nvl(name, 'unknown') as renamed") To get a

How to convert DataFrame to Dataset in Apache Spark in Java?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-02 23:08:13
I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.read.json("/tmp/persons.json") val ds = df.as[Person] ds.printSchema but in Java version I don't know how to convert Dataframe to Dataset? Any Idea? my effort is: DataFrame df = ctx.read().json(logFile); Encoder<Person> encoder = new Encoder<>(); Dataset<Person> ds = new Dataset<Person>(ctx,df.logicalPlan(),encoder); ds.printSchema(); but the compiler say: Error:(23, 27) java: org.apache.spark.sql.Encoder is abstract; cannot be instantiated Edited(Solution): solution based on @Leet

How to create a custom Encoder in Spark 2.X Datasets?

痞子三分冷 提交于 2019-12-02 22:38:47
Spark Datasets move away from Row's to Encoder 's for Pojo's/primitives. The Catalyst engine uses an ExpressionEncoder to convert columns in a SQL expression. However there do not appear to be other subclasses of Encoder available to use as a template for our own implementations. Here is an example of code that is happy in Spark 1.X / DataFrames that does not compile in the new regime: //mapping each row to RDD tuple df.map(row => { var id: String = if (!has_id) "" else row.getAs[String]("id") var label: String = row.getAs[String]("label") val channels : Int = if (!has_channels) 0 else row

How to create a Spark Dataset from an RDD

被刻印的时光 ゝ 提交于 2019-12-02 22:05:45
I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. How do we convert that RDD to a DataSet ? Note the newer spark.ml apis require inputs in the Dataset format. javadba Here is an answer that traverses an extra step - the DataFrame . We use the SQLContext to create a DataFrame and then create a DataSet using the desired object type - in this case a LabeledPoint : val sqlContext = new SQLContext(sc) val pointsTrainDf = sqlContext.createDataFrame(training) val pointsTrainDs = pointsTrainDf.as[LabeledPoint] Update Ever heard of a SparkSession ? (neither had I until

Encoder for Row Type Spark Datasets

牧云@^-^@ 提交于 2019-12-02 16:18:44
I would like to write an encoder for a Row type in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders. Below is an example of a map operation: In the example below, instead of returning Dataset<String>, I would like to return Dataset<Row> Dataset<String> output = dataset1.flatMap(new FlatMapFunction<Row, String>() { @Override public Iterator<String> call(Row row) throws Exception { ArrayList<String> obj = //some map operation return obj.iterator(); } },Encoders.STRING()); I understand that instead of a string Encoder needs to be written as

Converting DataSet to Json Array Spark using Scala

人走茶凉 提交于 2019-12-02 14:13:27
问题 I am new to the spark and unable to figure out the solution for the following problem. I have a JSON file to parse and then create a couple of metrics and write the data back into the JSON format. now following is my code I am using import org.apache.spark.sql._ import org.apache.log4j.{Level, Logger} import org.apache.spark.sql.functions._ object quick2 { def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.ERROR) val spark = SparkSession .builder .appName("quick1")

Spark CSV - No applicable constructor/method found for actual parameters

假装没事ソ 提交于 2019-12-02 04:11:44
I have an issue in using lambda functions on filters and maps of typed datasets in java spark applications. I am getting this runtime error ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 130, Column 126: No applicable constructor/method found for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates are: "public static java.sql.Date org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" I am using the below class and spark 2.2.0. Full example with sample data is available in https://gitlab.com

Load CSV data in to Dataframe and convert to Array using Apache Spark (Java)

谁都会走 提交于 2019-12-02 03:00:12
问题 I have a CSV file with below data : 1,2,5 2,4 2,3 I want to load them into a Dataframe having schema of string of array The output should be like below. [1, 2, 5] [2, 4] [2, 3] This has been answered using scala here: Spark: Convert column of string to an array I want to make it happen in Java. Please help 回答1: Below is the sample code in Java. You need to read your file using spark.read().text(String path) method and then call the split function. import static org.apache.spark.sql.functions

Converting DataSet to Json Array Spark using Scala

空扰寡人 提交于 2019-12-02 02:53:03
I am new to the spark and unable to figure out the solution for the following problem. I have a JSON file to parse and then create a couple of metrics and write the data back into the JSON format. now following is my code I am using import org.apache.spark.sql._ import org.apache.log4j.{Level, Logger} import org.apache.spark.sql.functions._ object quick2 { def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.ERROR) val spark = SparkSession .builder .appName("quick1") .master("local[*]") .getOrCreate() val rawData = spark.read.json("/home/umesh/Documents/Demo2/src/main