apache-spark-1.6

Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

丶灬走出姿态 提交于 2021-02-17 04:42:13
问题 I am using IntelliJ 2016.3 version. import sbt.Keys._ import sbt._ object ApplicationBuild extends Build { object Versions { val spark = "1.6.3" } val projectName = "example-spark" val common = Seq( version := "1.0", scalaVersion := "2.11.7" ) val customLibraryDependencies = Seq( "org.apache.spark" %% "spark-core" % Versions.spark % "provided", "org.apache.spark" %% "spark-sql" % Versions.spark % "provided", "org.apache.spark" %% "spark-hive" % Versions.spark % "provided", "org.apache.spark"

How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?

孤街浪徒 提交于 2020-06-25 18:11:28
问题 Is there any configuration property we can set it to disable / enable Hive support through spark-shell explicitly in spark 1.6. I tried to get all the sqlContext configuration properties with, sqlContext.getAllConfs.foreach(println) But, I am not sure on which property can actually required to disable/enable hive support. or Is there any other way to do this? 回答1: Spark >= 2.0 Enable and disable of Hive context is possible with config spark.sql.catalogImplementation Possible values for spark

Reading CSV into a Spark Dataframe with timestamp and date types

南笙酒味 提交于 2020-05-25 09:05:10
问题 It's CDH with Spark 1.6 . I am trying to import this Hypothetical CSV into a apache Spark DataFrame: $ hadoop fs -cat test.csv a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a I use databricks-csv jar. val textData = sqlContext.read .format("com.databricks.spark.csv") .option("header", "false") .option("delimiter", ",") .option("dateFormat", "yyyy-MM-dd HH:mm:ss") .option("inferSchema", "true") .option("nullValue", "null") .load("test.csv") I use

What to do with “WARN TaskSetManager: Stage contains a task of very large size”?

自古美人都是妖i 提交于 2020-04-07 18:58:30
问题 I use spark 1.6.1. My spark application reads more than 10000 parquet files stored in s3. val df = sqlContext.read.option("mergeSchema", "true").parquet(myPaths: _*) myPaths is an Array[String] that contains the paths of the 10000 parquet files. Each path is like this s3n://bucketname/blahblah.parquet Spark warns message like below. WARN TaskSetManager: Stage 4 contains a task of very large size (108KB). The maximum recommended task size is 100KB. Spark has managed to run and finish the job

PySpark- How to use a row value from one column to access another column which has the same name as of the row value

荒凉一梦 提交于 2020-01-13 06:18:11
问题 I have a PySpark df: +---+---+---+---+---+---+---+---+ | id| a1| b1| c1| d1| e1| f1|ref| +---+---+---+---+---+---+---+---+ | 0| 1| 23| 4| 8| 9| 5| b1| | 1| 2| 43| 8| 10| 20| 43| e1| | 2| 3| 15| 0| 1| 23| 7| b1| | 3| 4| 2| 6| 11| 5| 8| d1| | 4| 5| 6| 7| 2| 8| 1| f1| +---+---+---+---+---+---+---+---+ I eventually want to create another column "out" whose values are based on "ref" column. For example, in the first row ref column has b1 as value. In "out" column i would like to see column "b1"

Dynamic Allocation for Spark Streaming

无人久伴 提交于 2019-12-22 05:16:14
问题 I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark Streaming(in 1.6.1 version). But is Fixed in 2.0.0 JIRA link According to the PDF in this issue, it says there should be a configuration field called spark.streaming.dynamicAllocation.enabled=true But I dont see this configuration in the documentation.

How to change hdfs block size in pyspark?

烈酒焚心 提交于 2019-12-12 09:24:22
问题 I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work: sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m") Does this have to be set before starting the pySpark job? If so, how to do it. 回答1: Try setting it through sc._jsc.hadoopConfiguration() with SparkContext from pyspark import SparkConf, SparkContext conf = (SparkConf().setMaster("yarn")) sc = SparkContext(conf = conf) sc._jsc