apache-spark

Optimal way of creating a cache in the PySpark environment

拥有回忆 提交于 2021-02-08 05:32:11
问题 I am using Spark Streaming for creating a system to enrich incoming data from a cloudant database. Example - Incoming Message: {"id" : 123} Outgoing Message: {"id" : 123, "data": "xxxxxxxxxxxxxxxxxxx"} My code for the driver class is as follows: from Sample.Job import EnrichmentJob from Sample.Job import FunctionJob import pyspark from pyspark.streaming.kafka import KafkaUtils from pyspark import SparkContext, SparkConf, SQLContext from pyspark.streaming import StreamingContext from pyspark

In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

六眼飞鱼酱① 提交于 2021-02-08 04:59:26
问题 If I have a DataFrame called df that looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | N/A| baz| |null| etc| +----+----+ I can selectively replace values like so: val df2 = df.withColumn("a1", when($"a1" === "N/A", $"a2")) so that df2 looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | baz| baz| |null| etc| +----+----+ but why can't I check if it's null, like: val df3 = df2.withColumn("a1", when($"a1" === null, $"a2")) so that I get: +----+----+ | a1+ a2| +----+----+ | foo|

Why does pyspark fail with “Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'”?

时光总嘲笑我的痴心妄想 提交于 2021-02-08 04:31:09
问题 For the life of me I cannot figure out what is wrong with my PySpark install. I have installed all dependencies, including Hadoop, but PySpark cant find it--am I diagnosing this correctly? See the full error message below, but it ultimately fails on PySpark SQL pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':" nickeleres@Nicks-MBP:~$ pyspark Python 2.7.10 (default, Feb 7 2017, 00:08:15) [GCC 4.2.1 Compatible Apple

Iterate over different columns using withcolumn in Java Spark

最后都变了- 提交于 2021-02-08 04:10:47
问题 I have to modify a Dataset<Row> according to some rules that are in a List<Row> . I want to iterate over the Datset<Row> columns using Dataset.withColumn(...) as seen in the next example: (import necesary libraries...) SparkSession spark = SparkSession .builder() .appName("appname") .config("spark.some.config.option", "some-value") .getOrCreate(); Dataset<Row> dfToModify = spark.read().table("TableToModify"); List<Row> ListListWithInfo = new ArrayList<>(Arrays.asList()); ListWithInfo.add(0

Can't connect to Azure Data Lake Gen2 using PySpark and Databricks Connect

折月煮酒 提交于 2021-02-08 03:59:29
问题 Recently, Databricks launched Databricks Connect that allows you to write jobs using Spark native APIs and have them execute remotely on an Azure Databricks cluster instead of in the local Spark session. It works fine except when I try to access files in Azure Data Lake Storage Gen2. When I execute this: spark.read.json("abfss://...").count() I get this error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs

Spark2 session for Cassandra , sql queries

南楼画角 提交于 2021-02-08 03:50:12
问题 In Spark-2.0 what is the best way to create a Spark session. Because in both Spark-2.0 and Cassandra- the APIs have been reworked, essentially deprecating the SqlContext (and also CassandraSqlContext). So for executing SQL- either I create a Cassandra Session (com.datastax.driver.core.Session) and use execute( " ") . Or I have to create a SparkSession (org.apache.spark.sql.SparkSession) and execute sql(String sqlText) method. I don't know the SQL limitations of either - can someone explain.

Spark2 session for Cassandra , sql queries

旧时模样 提交于 2021-02-08 03:48:00
问题 In Spark-2.0 what is the best way to create a Spark session. Because in both Spark-2.0 and Cassandra- the APIs have been reworked, essentially deprecating the SqlContext (and also CassandraSqlContext). So for executing SQL- either I create a Cassandra Session (com.datastax.driver.core.Session) and use execute( " ") . Or I have to create a SparkSession (org.apache.spark.sql.SparkSession) and execute sql(String sqlText) method. I don't know the SQL limitations of either - can someone explain.

Why isn't a very big Spark stage using all available executors?

ε祈祈猫儿з 提交于 2021-02-08 03:39:56
问题 I am running a Spark job with some very big stages (e.g. >20k tasks), and am running it with 1k to 2k executors. In some cases, a stage will appear to run unstably: many available executors become idle over time, despite still being in the middle of a stage with many unfinished tasks. From the user perspective, it appears that tasks are finishing, but executors that have finished a given task do not get a new task assigned to them. As a result, the stage takes longer than it should, and a lot

Reading partition columns without partition column names

半城伤御伤魂 提交于 2021-02-08 03:36:05
问题 We have data stored in s3 partitioned in the following structure: bucket/directory/table/aaaa/bb/cc/dd/ where aaaa is the year, bb is the month, cc is the day and dd is the hour. As you can see, there are no partition keys in the path ( year=aaaa , month=bb , day=cc , hour=dd) . As a result, when I read the table into Spark, there is no year , month , day or hour columns. Is there anyway I can read the table into Spark and include the partitioned column without : changing the path names in s3

Spark error with google/guava library: java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.refreshAfterWrite

人盡茶涼 提交于 2021-02-08 03:11:10
问题 I have a simple spark project - in which in the pom.xml the dependencies are only the basic scala , scalatest / junit , and spark : <dependency> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-compiler</artifactId> <version