apache-spark

Spark error with google/guava library: java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.refreshAfterWrite

女生的网名这么多〃 提交于 2021-02-08 03:08:26
问题 I have a simple spark project - in which in the pom.xml the dependencies are only the basic scala , scalatest / junit , and spark : <dependency> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-compiler</artifactId> <version

Spark error with google/guava library: java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.refreshAfterWrite

无人久伴 提交于 2021-02-08 03:07:33
问题 I have a simple spark project - in which in the pom.xml the dependencies are only the basic scala , scalatest / junit , and spark : <dependency> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-compiler</artifactId> <version

spark dataframe: explode list column

梦想的初衷 提交于 2021-02-07 21:32:38
问题 I've got an output from Spark Aggregator which is List[Character] case class Character(name: String, secondName: String, faculty: String) val charColumn = HPAggregator.toColumn val resultDF = someDF.select(charColumn) So my dataframe looks like: +-----------------------------------------------+ | value | +-----------------------------------------------+ |[[harry, potter, gryffindor],[ron, weasley ... | +-----------------------------------------------+ Now I want to convert it to +------------

Read a binary column in spark using java language

六月ゝ 毕业季﹏ 提交于 2021-02-07 21:00:34
问题 I have a DataFrame witch contains a Binary column Type. DataFrame : +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Read a binary column in spark using java language

ⅰ亾dé卋堺 提交于 2021-02-07 20:57:39
问题 I have a DataFrame witch contains a Binary column Type. DataFrame : +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

java.lang.IllegalArgumentException when applying a Python UDF to a Spark dataframe

时光毁灭记忆、已成空白 提交于 2021-02-07 20:39:38
问题 I'm testing the example code provided in the documentation of pandas_udf (https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf), using Pyspark 2.3.1 on my local machine: from pyspark.sql import SparkSession from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v

Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??

时间秒杀一切 提交于 2021-02-07 20:30:26
问题 I have searched through every documentation and still didn't find why there is a prefix and what is c000 in the below file naming convention: file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319- c000.snappy.parquet 回答1: You should use "Talk is cheap, show me the code." methodology. Everything is not documented and one way to go is just the code. Consider part-1-2_3-4.parquet : Split/Partition number. Random UUID to prevent collision between different (appending)

Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??

倾然丶 夕夏残阳落幕 提交于 2021-02-07 20:30:05
问题 I have searched through every documentation and still didn't find why there is a prefix and what is c000 in the below file naming convention: file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319- c000.snappy.parquet 回答1: You should use "Talk is cheap, show me the code." methodology. Everything is not documented and one way to go is just the code. Consider part-1-2_3-4.parquet : Split/Partition number. Random UUID to prevent collision between different (appending)

How to execute async operations (i.e. returning a Future) from map/filter/etc.?

十年热恋 提交于 2021-02-07 20:20:23
问题 I have a DataSet.map operation that needs to pull data in from an external REST API. The REST API client returns a Future[Int] . Is it possible to have the DataSet.map operation somehow await this Future asynchronously? Or will I need to block the thread using Await.result ? Or is this just not the done thing... i.e. should I instead try and load the data held by the API into a DataSet of its own, and perform a join ? Thanks in advance! EDIT: Different from: Spark job with Async HTTP call

Hive/Impala performance with string partition key vs Integer partition key

梦想的初衷 提交于 2021-02-07 19:54:36
问题 Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions? 回答1: No, there is no such recommendation. Consider this: The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed