apache-spark | 易学教程

Optimal way of creating a cache in the PySpark environment

阅读更多关于 Optimal way of creating a cache in the PySpark environment

问题 I am using Spark Streaming for creating a system to enrich incoming data from a cloudant database. Example - Incoming Message: {"id" : 123} Outgoing Message: {"id" : 123, "data": "xxxxxxxxxxxxxxxxxxx"} My code for the driver class is as follows: from Sample.Job import EnrichmentJob from Sample.Job import FunctionJob import pyspark from pyspark.streaming.kafka import KafkaUtils from pyspark import SparkContext, SparkConf, SQLContext from pyspark.streaming import StreamingContext from pyspark

In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

阅读更多关于 In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

问题 If I have a DataFrame called df that looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | N/A| baz| |null| etc| +----+----+ I can selectively replace values like so: val df2 = df.withColumn("a1", when($"a1" === "N/A", $"a2")) so that df2 looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | baz| baz| |null| etc| +----+----+ but why can't I check if it's null, like: val df3 = df2.withColumn("a1", when($"a1" === null, $"a2")) so that I get: +----+----+ | a1+ a2| +----+----+ | foo|

Why does pyspark fail with “Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'”?

阅读更多关于 Why does pyspark fail with “Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'”?

问题 For the life of me I cannot figure out what is wrong with my PySpark install. I have installed all dependencies, including Hadoop, but PySpark cant find it--am I diagnosing this correctly? See the full error message below, but it ultimately fails on PySpark SQL pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':" nickeleres@Nicks-MBP:~$ pyspark Python 2.7.10 (default, Feb 7 2017, 00:08:15) [GCC 4.2.1 Compatible Apple

Iterate over different columns using withcolumn in Java Spark

阅读更多关于 Iterate over different columns using withcolumn in Java Spark

问题 I have to modify a Dataset<Row> according to some rules that are in a List<Row> . I want to iterate over the Datset<Row> columns using Dataset.withColumn(...) as seen in the next example: (import necesary libraries...) SparkSession spark = SparkSession .builder() .appName("appname") .config("spark.some.config.option", "some-value") .getOrCreate(); Dataset<Row> dfToModify = spark.read().table("TableToModify"); List<Row> ListListWithInfo = new ArrayList<>(Arrays.asList()); ListWithInfo.add(0

Can't connect to Azure Data Lake Gen2 using PySpark and Databricks Connect

阅读更多关于 Can't connect to Azure Data Lake Gen2 using PySpark and Databricks Connect

问题 Recently, Databricks launched Databricks Connect that allows you to write jobs using Spark native APIs and have them execute remotely on an Azure Databricks cluster instead of in the local Spark session. It works fine except when I try to access files in Azure Data Lake Storage Gen2. When I execute this: spark.read.json("abfss://...").count() I get this error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs

Spark2 session for Cassandra , sql queries

阅读更多关于 Spark2 session for Cassandra , sql queries

问题 In Spark-2.0 what is the best way to create a Spark session. Because in both Spark-2.0 and Cassandra- the APIs have been reworked, essentially deprecating the SqlContext (and also CassandraSqlContext). So for executing SQL- either I create a Cassandra Session (com.datastax.driver.core.Session) and use execute( " ") . Or I have to create a SparkSession (org.apache.spark.sql.SparkSession) and execute sql(String sqlText) method. I don't know the SQL limitations of either - can someone explain.

Spark2 session for Cassandra , sql queries

阅读更多关于 Spark2 session for Cassandra , sql queries

Why isn't a very big Spark stage using all available executors?

阅读更多关于 Why isn't a very big Spark stage using all available executors?

问题 I am running a Spark job with some very big stages (e.g. >20k tasks), and am running it with 1k to 2k executors. In some cases, a stage will appear to run unstably: many available executors become idle over time, despite still being in the middle of a stage with many unfinished tasks. From the user perspective, it appears that tasks are finishing, but executors that have finished a given task do not get a new task assigned to them. As a result, the stage takes longer than it should, and a lot

Reading partition columns without partition column names

阅读更多关于 Reading partition columns without partition column names

问题 We have data stored in s3 partitioned in the following structure: bucket/directory/table/aaaa/bb/cc/dd/ where aaaa is the year, bb is the month, cc is the day and dd is the hour. As you can see, there are no partition keys in the path ( year=aaaa , month=bb , day=cc , hour=dd) . As a result, when I read the table into Spark, there is no year , month , day or hour columns. Is there anyway I can read the table into Spark and include the partitioned column without : changing the path names in s3

Spark error with google/guava library: java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.refreshAfterWrite

阅读更多关于 Spark error with google/guava library: java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.refreshAfterWrite

问题 I have a simple spark project - in which in the pom.xml the dependencies are only the basic scala , scalatest / junit , and spark : <dependency> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-compiler</artifactId> <version