apache-spark-dataset

Apache Spark update a row in an RDD or Dataset based on another row

冷暖自知 提交于 2020-01-24 21:06:21
问题 I'm trying to figure how I can update some rows based on another another row. For example, I have some data like Id | useraname | ratings | city -------------------------------- 1, philip, 2.0, montreal, ... 2, john, 4.0, montreal, ... 3, charles, 2.0, texas, ... I want to update the users in the same city to the same groupId (either 1 or 2) Id | useraname | ratings | city -------------------------------- 1, philip, 2.0, montreal, ... 1, john, 4.0, montreal, ... 3, charles, 2.0, texas, ...

How to convert the datasets of Spark Row into string?

空扰寡人 提交于 2020-01-22 09:46:49
问题 I have written the code to access the Hive table using SparkSQL. Here is the code: SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example") .master("local[*]") .config("hive.metastore.uris", "thrift://localhost:9083") .enableHiveSupport() .getOrCreate(); Dataset<Row> df = spark.sql("select survey_response_value from health").toDF(); df.show(); I would like to know how I can convert the complete output to String or String array? As I am trying to work with another

How to convert the datasets of Spark Row into string?

故事扮演 提交于 2020-01-22 09:46:05
问题 I have written the code to access the Hive table using SparkSQL. Here is the code: SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example") .master("local[*]") .config("hive.metastore.uris", "thrift://localhost:9083") .enableHiveSupport() .getOrCreate(); Dataset<Row> df = spark.sql("select survey_response_value from health").toDF(); df.show(); I would like to know how I can convert the complete output to String or String array? As I am trying to work with another

Spark Dataset unique id performance - row_number vs monotonically_increasing_id

…衆ロ難τιáo~ 提交于 2020-01-13 09:44:08
问题 I want to assign a unique Id to my dataset rows. I know that there are two implementation options: First option: import org.apache.spark.sql.expressions.Window; ds.withColumn("id",row_number().over(Window.orderBy("a column"))) Second option: df.withColumn("id", monotonically_increasing_id()) The second option is not sequential ID and it doesn't really matter. I'm trying to figure out is if there are any performance issues of those implementation. That is, if one of this option is very slow

how to handle this in spark

寵の児 提交于 2020-01-10 06:16:44
问题 I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka. I have a scenario for some finance data coming from kafka topic. data (base dataset) contains companyId, year , prev_year fields information. If columns year === prev_year then I need to join with different table i.e. exchange_rates. If columns year =!= prev_year then I need to return the base dataset itself How to do this in spark-sql ? 回答1: You can refer below approach for

How to pick latest record in spark structured streaming join

荒凉一梦 提交于 2020-01-09 11:58:09
问题 I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka. I have rates meta data of currency sample as below : val ratesMetaDataDf = Seq( ("EUR","5/10/2019","1.130657","USD"), ("EUR","5/9/2019","1.13088","USD") ).toDF("base_code", "rate_date","rate_value","target_code") .withColumn("rate_date", to_date($"rate_date" ,"MM/dd/yyyy").cast(DateType)) .withColumn("rate_value", $"rate_value".cast(DoubleType)) Sales records which i received

How to pick latest record in spark structured streaming join

别说谁变了你拦得住时间么 提交于 2020-01-09 11:58:09
问题 I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka. I have rates meta data of currency sample as below : val ratesMetaDataDf = Seq( ("EUR","5/10/2019","1.130657","USD"), ("EUR","5/9/2019","1.13088","USD") ).toDF("base_code", "rate_date","rate_value","target_code") .withColumn("rate_date", to_date($"rate_date" ,"MM/dd/yyyy").cast(DateType)) .withColumn("rate_value", $"rate_value".cast(DoubleType)) Sales records which i received

Why is “Unable to find encoder for type stored in a Dataset” when creating a dataset of custom case class?

断了今生、忘了曾经 提交于 2020-01-08 12:23:52
问题 Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. import org.apache.spark.sql.SparkSession case class SimpleTuple(id: Int, desc: String) object DatasetTest { val dataList = List( SimpleTuple(5, "abc

Iterate through a column in Dataset which have array of key value pairs and find out a pair with max value

爱⌒轻易说出口 提交于 2020-01-04 09:39:14
问题 I have data in a dataframe , which was obtained from azure eventhub. Then I convert this data to json object and stored the required data into a dataset as shown below. Code for obtaining data from eventhub and store it into a dataframe. val connectionString = ConnectionStringBuilder(<ENDPOINT URL>) .setEventHubName(<EVENTHUB NAME>).build val currTime = Instant.now val ehConf = EventHubsConf(connectionString) .setConsumerGroup("<CONSUMER GRP>") .setStartingPosition(EventPosition

Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization? [duplicate]

ぐ巨炮叔叔 提交于 2020-01-03 06:32:46
问题 This question already has answers here : Spark 2.0 Dataset vs DataFrame (3 answers) Closed 2 years ago . To take advantage of Dataset 's optimization, do I have to explicitly use Dataframe's methods (e.g. df.select(col("name"), col("age") , etc) or calling any Dataset's methods - even RDD-like methods (e.g. filter , map , etc) would also allow for optimization? 回答1: Dataframe optimization comes in general in 3 flavors: Tungsten memory management Catalyst query optimization wholestage codegen