apache-spark-dataset

Jaro-Winkler score calculation in Apache Spark

北战南征 提交于 2019-12-11 06:08:13
问题 We need to implement Jaro-Winkler distance calculation across string in Apache Spark Dataset . We are new to spark and after searching in web we are not able to find much. It would be great if you can guide us. We thought of using flatMap then realized it won’t help, then we tried to use couple of foreach loops but not able to figure how to go forward. As each of the string has to be compared against all. Like in the below dataset. RowFactory.create(0, "Hi I heard about Spark"), RowFactory

Created a nested schema in Apache Spark SQL

。_饼干妹妹 提交于 2019-12-11 05:59:54
问题 I want to load a simple JSON schema in to my SparkSession which has employee with address array . The sample JSON is below {"firstName":"Neil","lastName":"Irani", "addresses" : [ { "city" : "Brindavan", "state" : "NJ" }, { "city" : "Subala", "state" : "DT" }]} I'm trying to create the schema for loading my JSON, I believe there is something wrong in the below way of creating schema ... please advise .. the below code is in Java ... I could not find a reasonable sample List<StructField>

How should I convert an RDD of org.apache.spark.ml.linalg.Vector to Dataset?

谁说我不能喝 提交于 2019-12-11 00:54:12
问题 I'm struggling to understand how the conversion among RDDs, DataSets and DataFrames works. I'm pretty new to Spark, and I get stuck every time I need to pass from a data model to another (especially from RDDs to Datasets and Dataframes). Could anyone explain me the right way to do it? As an example, now I have a RDD[org.apache.spark.ml.linalg.Vector] and I need to pass it to my machine learning algorithm, for example a KMeans (Spark DataSet MLlib). So, I need to convert it to Dataset with a

Spark 2.0 DataSets groupByKey and divide operation and type safety

北慕城南 提交于 2019-12-10 14:14:19
问题 I am very much pleased with Spark 2.0 DataSets because of it's compile time type safety. But here is couple of problem that I am not able to work out, I also didn't find good documentation for this. Problem #1 - divide operation on aggregated column- Consider below code - I have a DataSet[MyCaseClass] and I wanted to groupByKey on c1,c2,c3 and sum(c4) / 8. The below code works well if I just calculate the sum but it gives compile time error for divide(8). I wonder how I can achieve following.

S3 SlowDown error in Spark on EMR

谁都会走 提交于 2019-12-09 02:18:17
问题 I am getting this error when writing a parquet file, this has started to happen recently com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: 2CA496E2AB87DC16), S3 Extended Request ID: 1dBrcqVGJU9VgoT79NAVGyN0fsbj9+6bipC7op97ZmP+zSFIuH72lN03ZtYabNIA2KaSj18a8ho= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.handleErrorResponse

Can I say only current batch by watermarking and window logic for aggregating a streaming data in Append Output mode?

时间秒杀一切 提交于 2019-12-08 13:42:59
问题 I am taking a join of a streaming dataset in LHS with a static dataset in RHS. Since there can be multiple matches for a row in LHS in the static dataset the data explodes into duplicate rows for one id of LHS during the left_outer join, I want to group all these rows collecting the RHS matches into a list. Since it is guaranteed there will be no duplicates in the streaming data, I don't want to introduce a synthetic watermarking column and aggregate the data based on a time-window around

Select latest timestamp record after a window operation for every group in the data with Spark Scala

懵懂的女人 提交于 2019-12-08 11:33:22
问题 I ran a count of attempts by (user,app) over a time window of day(86400). I want to extract the rows with latest timestamp with the count and remove unnecessary previous counts. Make sure your answer considers the time window. One user with 1 device can do make multiple attempts a day or a week, I wanna be able to retrieve those particular moments with the final count in every specific window. My intial dataset is like this: val df = sc.parallelize(Seq( ("user1", "iphone", "2017-12-22 10:06

How to lower the case of column names of a data frame but not its values?

只愿长相守 提交于 2019-12-08 05:22:25
问题 How to lower the case of column names of a data frame but not its values? using RAW Spark SQL and Dataframe methods ? Input data frame (Imagine I have 100's of these columns in uppercase) NAME | COUNTRY | SRC | CITY | DEBIT --------------------------------------------- "foo"| "NZ" | salary | "Auckland" | 15.0 "bar"| "Aus" | investment | "Melbourne"| 12.5 taget dataframe name | country | src | city | debit ------------------------------------------------ "foo"| "NZ" | salary | "Auckland" | 15

How to save nested or JSON object in spark Dataset with converting to RDD?

本秂侑毒 提交于 2019-12-08 04:42:55
问题 I am working on the spark code where I have to save multiple column values as a object format and save the result to mongodb Given Dataset |---|-----|------|----------| |A |A_SRC|Past_A|Past_A_SRC| |---|-----|------|----------| |a1 | s1 | a2 | s2 | What I Have tried val ds1 = Seq(("1", "2", "3","4")).toDF("a", "src", "p_a","p_src") val recordCol = functions.to_json(Seq($"a", $"src", $"p_a",$"p_src"),struct("a", "src", "p_a","p_src")) as "A" ds1.select(recordCol).show(truncate = false) gives

How to convert a JavaPairRDD to Dataset?

半城伤御伤魂 提交于 2019-12-08 04:13:40
问题 SparkSession.createDataset() only allows List, RDD, or Seq - but it doesn't support JavaPairRDD . So if I have a JavaPairRDD<String, User> that I want to create a Dataset from, would a viable workround for the SparkSession.createDataset() limitation to create a wrapper UserMap class that contains two fields: String and User . Then do spark.createDataset(userMap, Encoders.bean(UserMap.class)); ? 回答1: If you can convert the JavaPairRDD to List<Tuple2<K, V>> then you can use createDataset method