apache-spark-dataset

Load CSV data in to Dataframe and convert to Array using Apache Spark (Java)

百般思念 提交于 2019-12-02 01:17:41
I have a CSV file with below data : 1,2,5 2,4 2,3 I want to load them into a Dataframe having schema of string of array The output should be like below. [1, 2, 5] [2, 4] [2, 3] This has been answered using scala here: Spark: Convert column of string to an array I want to make it happen in Java. Please help Below is the sample code in Java. You need to read your file using spark.read().text(String path) method and then call the split function. import static org.apache.spark.sql.functions.split; public class SparkSample { public static void main(String[] args) { SparkSession spark = SparkSession

When to use Spark DataFrame/Dataset API and when to use plain RDD?

天涯浪子 提交于 2019-12-01 16:42:31
Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms. However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions. Namely, it is indicated in sourcecode of org.apache.spark.sql.catalyst.expressions.ScalaUDF , that every

Mapping json to case class with Spark (spaces in the field name)

∥☆過路亽.° 提交于 2019-12-01 11:28:48
I am trying to read a json file with the spark Dataset API, the problem is that this json contains spaces in some of the field names. This would be a json row {"Field Name" : "value"} My case class needs to be like this case class MyType(`Field Name`: String) Then I can load the file into a DataFrame and it will load the correct schema val dataframe = spark.read.json(path) The problem comes when I try to convert the DataFrame to a Dataset[MyType] dataframe.as[MyType] The StructSchema loaded by the Encoder[MyType] is wrong and it introduces $u0020 instead of the space and I get the following

Dataframe to Dataset which has type Any

跟風遠走 提交于 2019-12-01 06:55:20
I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this case class MyClass(a : Any, ...) val df = ... df.map(x => MyClass(x.get(0), ...)) As you can see MyClass has a field of type Any , as I do not know at compile time the type of the field I retrieve with x.get(0) . It may be a long, string, int, etc. However, when I try to execute code similar to what you see above, I get an exception: java.lang.ClassNotFoundException: scala.Any With some debugging, I realized that the exception is raised, not

Dataframe to Dataset which has type Any

馋奶兔 提交于 2019-12-01 04:25:46
问题 I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this case class MyClass(a : Any, ...) val df = ... df.map(x => MyClass(x.get(0), ...)) As you can see MyClass has a field of type Any , as I do not know at compile time the type of the field I retrieve with x.get(0) . It may be a long, string, int, etc. However, when I try to execute code similar to what you see above, I get an exception: java.lang

S3 SlowDown error in Spark on EMR

Deadly 提交于 2019-12-01 02:24:51
I am getting this error when writing a parquet file, this has started to happen recently com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: 2CA496E2AB87DC16), S3 Extended Request ID: 1dBrcqVGJU9VgoT79NAVGyN0fsbj9+6bipC7op97ZmP+zSFIuH72lN03ZtYabNIA2KaSj18a8ho= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1389) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient

Spark / Scala: fill nan with last good observation

帅比萌擦擦* 提交于 2019-12-01 01:20:18
I am using the spark 2.0.1 and want to fill nan values with the last good known value in the column. The only reference for spark I could find Spark / Scala: forward fill with last observation or Fill in null with previously known good value with pyspark which seem to use RDD. I would rather like to stay in the data frame / dataset world and possible handle multiple nan values. Is this possible? My assumption is that the data (initially loaded from e.g. a CSV file is ordered by time and this order is preserved in the distributed setting e.g. filling by close / last good known value is correct.

How to use approxQuantile by group?

混江龙づ霸主 提交于 2019-12-01 00:12:11
Spark has SQL function percentile_approx() , and its Scala counterpart is df.stat.approxQuantile() . However, the Scala counterpart cannot be used on grouped datasets, something like df.groupby("foo").stat.approxQuantile() , as answered here: https://stackoverflow.com/a/51933027 . But it's possible to do both grouping and percentiles in SQL syntax. So I'm wondering, maybe I can define an UDF from SQL percentile_approx function and use it on my grouped dataset? While you cannot use approxQuantile in an UDF, and you there is no Scala wrapper for percentile_approx it is not hard to implement one

How to split multi-value column into separate rows using typed Dataset?

荒凉一梦 提交于 2019-11-30 20:08:08
I am facing an issue of how to split a multi-value column, i.e. List[String] , into separate rows. The initial dataset has following types: Dataset[(Integer, String, Double, scala.List[String])] +---+--------------------+-------+--------------------+ | id| text | value | properties | +---+--------------------+-------+--------------------+ | 0|Lorem ipsum dolor...| 1.0|[prp1, prp2, prp3..]| | 1|Lorem ipsum dolor...| 2.0|[prp4, prp5, prp6..]| | 2|Lorem ipsum dolor...| 3.0|[prp7, prp8, prp9..]| The resulting dataset should have following types: Dataset[(Integer, String, Double, String)] and the

Spark / Scala: fill nan with last good observation

夙愿已清 提交于 2019-11-30 19:28:55
问题 I am using the spark 2.0.1 and want to fill nan values with the last good known value in the column. The only reference for spark I could find Spark / Scala: forward fill with last observation or Fill in null with previously known good value with pyspark which seem to use RDD. I would rather like to stay in the data frame / dataset world and possible handle multiple nan values. Is this possible? My assumption is that the data (initially loaded from e.g. a CSV file is ordered by time and this