apache-spark-1.6 | 易学教程

Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

阅读更多关于 Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

问题 I am using IntelliJ 2016.3 version. import sbt.Keys._ import sbt._ object ApplicationBuild extends Build { object Versions { val spark = "1.6.3" } val projectName = "example-spark" val common = Seq( version := "1.0", scalaVersion := "2.11.7" ) val customLibraryDependencies = Seq( "org.apache.spark" %% "spark-core" % Versions.spark % "provided", "org.apache.spark" %% "spark-sql" % Versions.spark % "provided", "org.apache.spark" %% "spark-hive" % Versions.spark % "provided", "org.apache.spark"

Where can I find the jars folder in Spark 1.6?

阅读更多关于 Where can I find the jars folder in Spark 1.6?

来源： https://stackoverflow.com/questions/42646510/where-can-i-find-the-jars-folder-in-spark-1-6

Where can I find the jars folder in Spark 1.6?

阅读更多关于 Where can I find the jars folder in Spark 1.6?

来源： https://stackoverflow.com/questions/42646510/where-can-i-find-the-jars-folder-in-spark-1-6

Where can I find the jars folder in Spark 1.6?

阅读更多关于 Where can I find the jars folder in Spark 1.6?

来源： https://stackoverflow.com/questions/42646510/where-can-i-find-the-jars-folder-in-spark-1-6

How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?

阅读更多关于 How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?

问题 Is there any configuration property we can set it to disable / enable Hive support through spark-shell explicitly in spark 1.6. I tried to get all the sqlContext configuration properties with, sqlContext.getAllConfs.foreach(println) But, I am not sure on which property can actually required to disable/enable hive support. or Is there any other way to do this? 回答1: Spark >= 2.0 Enable and disable of Hive context is possible with config spark.sql.catalogImplementation Possible values for spark

Reading CSV into a Spark Dataframe with timestamp and date types

阅读更多关于 Reading CSV into a Spark Dataframe with timestamp and date types

问题 It's CDH with Spark 1.6 . I am trying to import this Hypothetical CSV into a apache Spark DataFrame: $ hadoop fs -cat test.csv a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a I use databricks-csv jar. val textData = sqlContext.read .format("com.databricks.spark.csv") .option("header", "false") .option("delimiter", ",") .option("dateFormat", "yyyy-MM-dd HH:mm:ss") .option("inferSchema", "true") .option("nullValue", "null") .load("test.csv") I use

What to do with “WARN TaskSetManager: Stage contains a task of very large size”?

阅读更多关于 What to do with “WARN TaskSetManager: Stage contains a task of very large size”?

问题 I use spark 1.6.1. My spark application reads more than 10000 parquet files stored in s3. val df = sqlContext.read.option("mergeSchema", "true").parquet(myPaths: _*) myPaths is an Array[String] that contains the paths of the 10000 parquet files. Each path is like this s3n://bucketname/blahblah.parquet Spark warns message like below. WARN TaskSetManager: Stage 4 contains a task of very large size (108KB). The maximum recommended task size is 100KB. Spark has managed to run and finish the job

PySpark- How to use a row value from one column to access another column which has the same name as of the row value

阅读更多关于 PySpark- How to use a row value from one column to access another column which has the same name as of the row value

问题 I have a PySpark df: +---+---+---+---+---+---+---+---+ | id| a1| b1| c1| d1| e1| f1|ref| +---+---+---+---+---+---+---+---+ | 0| 1| 23| 4| 8| 9| 5| b1| | 1| 2| 43| 8| 10| 20| 43| e1| | 2| 3| 15| 0| 1| 23| 7| b1| | 3| 4| 2| 6| 11| 5| 8| d1| | 4| 5| 6| 7| 2| 8| 1| f1| +---+---+---+---+---+---+---+---+ I eventually want to create another column "out" whose values are based on "ref" column. For example, in the first row ref column has b1 as value. In "out" column i would like to see column "b1"

Dynamic Allocation for Spark Streaming

阅读更多关于 Dynamic Allocation for Spark Streaming

问题 I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark Streaming(in 1.6.1 version). But is Fixed in 2.0.0 JIRA link According to the PDF in this issue, it says there should be a configuration field called spark.streaming.dynamicAllocation.enabled=true But I dont see this configuration in the documentation.

How to change hdfs block size in pyspark?

阅读更多关于 How to change hdfs block size in pyspark?

问题 I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work: sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m") Does this have to be set before starting the pySpark job? If so, how to do it. 回答1: Try setting it through sc._jsc.hadoopConfiguration() with SparkContext from pyspark import SparkConf, SparkContext conf = (SparkConf().setMaster("yarn")) sc = SparkContext(conf = conf) sc._jsc