apache-spark-dataset | 易学教程

Apache Spark update a row in an RDD or Dataset based on another row

阅读更多关于 Apache Spark update a row in an RDD or Dataset based on another row

问题 I'm trying to figure how I can update some rows based on another another row. For example, I have some data like Id | useraname | ratings | city -------------------------------- 1, philip, 2.0, montreal, ... 2, john, 4.0, montreal, ... 3, charles, 2.0, texas, ... I want to update the users in the same city to the same groupId (either 1 or 2) Id | useraname | ratings | city -------------------------------- 1, philip, 2.0, montreal, ... 1, john, 4.0, montreal, ... 3, charles, 2.0, texas, ...

How to convert the datasets of Spark Row into string?

阅读更多关于 How to convert the datasets of Spark Row into string?

问题 I have written the code to access the Hive table using SparkSQL. Here is the code: SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example") .master("local[*]") .config("hive.metastore.uris", "thrift://localhost:9083") .enableHiveSupport() .getOrCreate(); Dataset<Row> df = spark.sql("select survey_response_value from health").toDF(); df.show(); I would like to know how I can convert the complete output to String or String array? As I am trying to work with another

How to convert the datasets of Spark Row into string?

阅读更多关于 How to convert the datasets of Spark Row into string?

Spark Dataset unique id performance - row_number vs monotonically_increasing_id

阅读更多关于 Spark Dataset unique id performance - row_number vs monotonically_increasing_id

问题 I want to assign a unique Id to my dataset rows. I know that there are two implementation options: First option: import org.apache.spark.sql.expressions.Window; ds.withColumn("id",row_number().over(Window.orderBy("a column"))) Second option: df.withColumn("id", monotonically_increasing_id()) The second option is not sequential ID and it doesn't really matter. I'm trying to figure out is if there are any performance issues of those implementation. That is, if one of this option is very slow

how to handle this in spark

阅读更多关于 how to handle this in spark

问题 I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka. I have a scenario for some finance data coming from kafka topic. data (base dataset) contains companyId, year , prev_year fields information. If columns year === prev_year then I need to join with different table i.e. exchange_rates. If columns year =!= prev_year then I need to return the base dataset itself How to do this in spark-sql ? 回答1: You can refer below approach for

How to pick latest record in spark structured streaming join

阅读更多关于 How to pick latest record in spark structured streaming join

问题 I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka. I have rates meta data of currency sample as below : val ratesMetaDataDf = Seq( ("EUR","5/10/2019","1.130657","USD"), ("EUR","5/9/2019","1.13088","USD") ).toDF("base_code", "rate_date","rate_value","target_code") .withColumn("rate_date", to_date($"rate_date" ,"MM/dd/yyyy").cast(DateType)) .withColumn("rate_value", $"rate_value".cast(DoubleType)) Sales records which i received

How to pick latest record in spark structured streaming join

阅读更多关于 How to pick latest record in spark structured streaming join

Why is “Unable to find encoder for type stored in a Dataset” when creating a dataset of custom case class?

阅读更多关于 Why is “Unable to find encoder for type stored in a Dataset” when creating a dataset of custom case class?

问题 Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. import org.apache.spark.sql.SparkSession case class SimpleTuple(id: Int, desc: String) object DatasetTest { val dataList = List( SimpleTuple(5, "abc

Iterate through a column in Dataset which have array of key value pairs and find out a pair with max value

阅读更多关于 Iterate through a column in Dataset which have array of key value pairs and find out a pair with max value

问题 I have data in a dataframe , which was obtained from azure eventhub. Then I convert this data to json object and stored the required data into a dataset as shown below. Code for obtaining data from eventhub and store it into a dataframe. val connectionString = ConnectionStringBuilder(<ENDPOINT URL>) .setEventHubName(<EVENTHUB NAME>).build val currTime = Instant.now val ehConf = EventHubsConf(connectionString) .setConsumerGroup("<CONSUMER GRP>") .setStartingPosition(EventPosition

Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization? [duplicate]

阅读更多关于 Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization? [duplicate]

问题 This question already has answers here : Spark 2.0 Dataset vs DataFrame (3 answers) Closed 2 years ago . To take advantage of Dataset 's optimization, do I have to explicitly use Dataframe's methods (e.g. df.select(col("name"), col("age") , etc) or calling any Dataset's methods - even RDD-like methods (e.g. filter , map , etc) would also allow for optimization? 回答1: Dataframe optimization comes in general in 3 flavors: Tungsten memory management Catalyst query optimization wholestage codegen