apache-spark-dataset

How to transform a csv string into a Spark-ML compatible Dataset<Row> format?

坚强是说给别人听的谎言 提交于 2019-12-24 09:52:49
问题 I have a Dataset<Row> df , that contains two columns ("key" and "value") of type string . df.printSchema(); is giving me the following output: root |-- key: string (nullable = true) |-- value: string (nullable = true) The content of the value column is actually a csv formated line (coming from a kafka topic), with the last entry of that line representing the class label and all the previous entries beeing the features (first row not included in the dataset): feature0,feature1,label 0

How to use different window specification per column values?

吃可爱长大的小学妹 提交于 2019-12-24 09:48:14
问题 This is my partitionBy condition which i need to change based on the column value from the data frame . val windowSpec = Window.partitionBy("col1", "clo2","clo3").orderBy($"Col5".desc) Now if the value of the one of the column (col6) in data frame is I then above condition . But when the value of the column(col6) changes O then below condition val windowSpec = Window.partitionBy("col1","clo3").orderBy($"Col5".desc) How can i implement it in the spark data frame . So it is like for each record

How to pass Encoder as parameter to dataframe's as method

只谈情不闲聊 提交于 2019-12-24 09:48:12
问题 I want to convert dataFrame to dataSet by using different case class. Now, my code is like below. case Class Views(views: Double) case Class Clicks(clicks: Double) def convertViewsDFtoDS(df: DataFrame){ df.as[Views] } def convertClicksDFtoDS(df: DataFrame){ df.as[Clicks] } So, my question is "Is there anyway I can use one general function to this by pass case class as extra parameter to this function?" 回答1: It seems a bit obsolete ( as method does exactly what you want) but you can import org

Unexpected encoder behaviour when applying a flatMap operation on a Apache Spark Dataset<Row>

吃可爱长大的小学妹 提交于 2019-12-24 02:15:30
问题 I'm trying to convert a csv-string that actually contains double values into a spark-ml compatible dataset. Since I don't know the number of features to be expected beforehand, I decided to use a helper class "Instance", that already contains the right datatypes to be used by the classifiers and that is working as intended in some other cases already: public class Instance implements Serializable { /** * */ private static final long serialVersionUID = 6091606543088855593L; private Vector

Split dataset based on column values in spark

…衆ロ難τιáo~ 提交于 2019-12-23 07:36:41
问题 I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce the usage of Java code. List<Row> lsts= countsByAge.collectAsList(); for(Row lst:lsts){ String man=lst.toString(); man = man.replaceAll("[\\p{Ps}\\p{Pe}]", ""); Dataset<Row> DF = src.filter("Manufacturer='"+man+"'"); DF.show(); } The Code, Input and Output Datasets are as shown below. package org

Spark doing exchange of partitions already correctly distributed

扶醉桌前 提交于 2019-12-23 06:49:33
问题 I am joining 2 datasets by two columns and result is dataset containing 55 billion rows. After that I have to do some aggregation on this DS by different column than the ones used in join. Problem is that Spark is doing exchange partition after join(taking too much time with 55 billion rows) although data is already correctly distributed because aggregate column is unique. I know that aggregation key is correctly distributed and is there a way telling this to Spark app? 回答1: 1) Go to Spark UI

Hive partitions, Spark partitions and joins in Spark - how they relate

℡╲_俬逩灬. 提交于 2019-12-22 08:57:56
问题 Trying to understand how Hive partitions relate to Spark partitions, culminating in a question about joins. I have 2 external Hive tables; both backed by S3 buckets and partitioned by date ; so in each bucket there are keys with name format date=<yyyy-MM-dd>/<filename> . Question 1: If I read this data into Spark: val table1 = spark.table("table1").as[Table1Row] val table2 = spark.table("table2").as[Table2Row] then how many partitions are the resultant datasets going to have respectively?

How to find first non-null values in groups? (secondary sorting using dataset api)

 ̄綄美尐妖づ 提交于 2019-12-21 12:04:42
问题 I am working on a dataset which represents a stream of events (like fired as tracking events from a website). All the events have a timestamp. One use case we often have is trying to find the 1st non null value for a given field. So for example something like gets us most the way there: val eventsDf = spark.read.json(jsonEventsPath) case class ProjectedFields(visitId: String, userId: Int, timestamp: Long ... ) val projectedEventsDs = eventsDf.select( eventsDf("message.visit.id").alias(

How to convert DataFrame to Dataset in Apache Spark in Java?

放肆的年华 提交于 2019-12-20 10:33:05
问题 I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.read.json("/tmp/persons.json") val ds = df.as[Person] ds.printSchema but in Java version I don't know how to convert Dataframe to Dataset? Any Idea? my effort is: DataFrame df = ctx.read().json(logFile); Encoder<Person> encoder = new Encoder<>(); Dataset<Person> ds = new Dataset<Person>(ctx,df.logicalPlan(),encoder); ds.printSchema(); but the compiler say: Error:(23, 27) java: org

How to create a Spark Dataset from an RDD

混江龙づ霸主 提交于 2019-12-20 10:16:45
问题 I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. How do we convert that RDD to a DataSet ? Note the newer spark.ml apis require inputs in the Dataset format. 回答1: Here is an answer that traverses an extra step - the DataFrame . We use the SQLContext to create a DataFrame and then create a DataSet using the desired object type - in this case a LabeledPoint : val sqlContext = new SQLContext(sc) val pointsTrainDf = sqlContext.createDataFrame(training) val