apache-spark | 易学教程

Spark error with google/guava library: java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.refreshAfterWrite

阅读更多关于 Spark error with google/guava library: java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.refreshAfterWrite

问题 I have a simple spark project - in which in the pom.xml the dependencies are only the basic scala , scalatest / junit , and spark : <dependency> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-compiler</artifactId> <version

Spark error with google/guava library: java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.refreshAfterWrite

阅读更多关于 Spark error with google/guava library: java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.refreshAfterWrite

spark dataframe: explode list column

阅读更多关于 spark dataframe: explode list column

问题 I've got an output from Spark Aggregator which is List[Character] case class Character(name: String, secondName: String, faculty: String) val charColumn = HPAggregator.toColumn val resultDF = someDF.select(charColumn) So my dataframe looks like: +-----------------------------------------------+ | value | +-----------------------------------------------+ |[[harry, potter, gryffindor],[ron, weasley ... | +-----------------------------------------------+ Now I want to convert it to +------------

Read a binary column in spark using java language

阅读更多关于 Read a binary column in spark using java language

问题 I have a DataFrame witch contains a Binary column Type. DataFrame : +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Read a binary column in spark using java language

阅读更多关于 Read a binary column in spark using java language

java.lang.IllegalArgumentException when applying a Python UDF to a Spark dataframe

阅读更多关于 java.lang.IllegalArgumentException when applying a Python UDF to a Spark dataframe

问题 I'm testing the example code provided in the documentation of pandas_udf (https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf), using Pyspark 2.3.1 on my local machine: from pyspark.sql import SparkSession from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v

Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??

阅读更多关于 Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??

问题 I have searched through every documentation and still didn't find why there is a prefix and what is c000 in the below file naming convention: file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319- c000.snappy.parquet 回答1: You should use "Talk is cheap, show me the code." methodology. Everything is not documented and one way to go is just the code. Consider part-1-2_3-4.parquet : Split/Partition number. Random UUID to prevent collision between different (appending)

Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??

阅读更多关于 Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??

How to execute async operations (i.e. returning a Future) from map/filter/etc.?

阅读更多关于 How to execute async operations (i.e. returning a Future) from map/filter/etc.?

问题 I have a DataSet.map operation that needs to pull data in from an external REST API. The REST API client returns a Future[Int] . Is it possible to have the DataSet.map operation somehow await this Future asynchronously? Or will I need to block the thread using Await.result ? Or is this just not the done thing... i.e. should I instead try and load the data held by the API into a DataSet of its own, and perform a join ? Thanks in advance! EDIT: Different from: Spark job with Async HTTP call

Hive/Impala performance with string partition key vs Integer partition key

阅读更多关于 Hive/Impala performance with string partition key vs Integer partition key

问题 Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions? 回答1: No, there is no such recommendation. Consider this: The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed