spark-dataframe | 易学教程

Not able to create View on Hive table using HiveContext. Getting DBLock Manager issue

阅读更多关于 Not able to create View on Hive table using HiveContext. Getting DBLock Manager issue

问题 I am not able to create View on Hive table using HiveContext . Facing DBLock Manager Lock issue. Same View creation query works fine in Hive Beeline. But its failing when executed using Hive Context. 17/02/23 10:44:18 INFO metastore: Trying to connect to metastore with URI thrift://XXXXXXXXXXXXXXXXXXXXXXXXXXX 17/02/23 10:44:18 INFO metastore: Connected to metastore. 17/02/23 10:44:18 INFO DbLockManager: Response to queryId=XXXXXXXX_20170223104411_2b1a475e-ad6d-45b3-8ec6-6a30a9123664

Spark Dataframes are getting created successfully but not able to write into the Local Disk

阅读更多关于 Spark Dataframes are getting created successfully but not able to write into the Local Disk

问题 I am using IntelliJ IDE for executing Spark Scala code on Microsoft Windows Platform. I have four Spark Dataframes of around 30000 records each and I tried to take one column from each of those Dataframes as part of my requirement. I used Spark SQL function to do it and it got executed successfully. When I execute DF.show() or DF.count() method, I am able to see results in the screen but when I tried to write the dataframe into my local disk (windows directory) the job is getting aborted with

Fastest way to check if DataFrame(Scala) is empty?

阅读更多关于 Fastest way to check if DataFrame(Scala) is empty?

问题 How to check if DataFrame( Scala ) is empty in fastest way?I use DF.limit(1).rdd.isEmpty , faster than DF.rdd.isEmpty,but not ideal.Is there any better way to do that? 回答1: I usually wrap a call to first around a Try : import scala.util.Try val t = Try(df.first) From there you can match on it if it's a Success or Failure to control logic: import scala.util.{Success,Failure} t match { case Success(df) => //do stuff with the dataframe case Failure(e) => // dataframe is empty; do other stuff //e

Getting Empty set while reading data from kafka-Spark-Streaming

阅读更多关于 Getting Empty set while reading data from kafka-Spark-Streaming

问题 Hi i am new to Spark Streaming. i am trying to read the xml file and send it to kafka topic. Here is my Kafka Code Which sends data to Kafka-console-consumer. Code: package org.apache.kafka.Kafka_Producer; import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.util.Properties; import java.util.Properties; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutionException; import kafka

Extracting values from a Spark column containing nested values [duplicate]

阅读更多关于 Extracting values from a Spark column containing nested values [duplicate]

问题 This question already has answers here : Querying Spark SQL DataFrame with complex types (3 answers) Closed last year . This is part of the schema of my mongodb collection: |-- variables: struct (nullable = true) | |-- actives: struct (nullable = true) | | |-- data: struct (nullable = true) | | | |-- 0: struct (nullable = true) | | | | |--active: integer (nullable = true) | | | | |-- inactive: integer (nullable = true) I've fetched the collection and stored it in a Spark dataframe and am now

Spark cast column to sql type stored in string

阅读更多关于 Spark cast column to sql type stored in string

问题 The simple request is I need help adding a column to a dataframe but, the column has to be empty, its type is from ...spark.sql.types and the type has to be defined from a string. I can probably do this with ifs or case but I'm looking for something more elegant. Something that does not require writing a case for every type in org.apache.spark.sql.types If I do this for example: df = df.withColumn("col_name", lit(null).cast(org.apache.spark.sql.types.StringType)) It works as intended, but I

Create a column in a PySpark dataframe using a list whose indices are present in one column of the dataframe

阅读更多关于 Create a column in a PySpark dataframe using a list whose indices are present in one column of the dataframe

问题 I'm new to Python and PySpark. I have a dataframe in PySpark like the following: ## +---+---+------+ ## | x1| x2| x3 | ## +---+---+------+ ## | 0| a | 13.0| ## | 2| B | -33.0| ## | 1| B | -63.0| ## +---+---+------+ I have an array: arr = [10, 12, 13] I want to create a column x4 in the dataframe such that it should have the corresponding values from the list based on the values of x1 as indices. The final dataset should look like: ## +---+---+------+-----+ ## | x1| x2| x3 | x4 | ## +---+---+-

Programmatically generate the schema AND the data for a dataframe in Apache Spark

阅读更多关于 Programmatically generate the schema AND the data for a dataframe in Apache Spark

问题 I would like to dynamically generate a dataframe containing a header record for a report, so creating a dataframe from the value of the string below: val headerDescs : String = "Name,Age,Location" val headerSchema = StructType(headerDescs.split(",").map(fieldName => StructField(fieldName, StringType, true))) However now I want to do the same for the data (which is in effect the same data i.e. the metadata). I create an RDD : val headerRDD = sc.parallelize(headerDescs.split(",")) I then

How to add jar using HiveContext in the spark job

阅读更多关于 How to add jar using HiveContext in the spark job

问题 I am trying to add JSONSerDe jar file to in order to access the json data load the JSON data to hive table from the spark job. My code is shown below: SparkConf sparkConf = new SparkConf().setAppName("KafkaStreamToHbase"); JavaSparkContext sc = new JavaSparkContext(sparkConf); JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(10)); final SQLContext sqlContext = new SQLContext(sc); final HiveContext hiveContext = new HiveContext(sc); hiveContext.sql("ADD JAR hdfs:/

spark custom kryo encoder not providing schema for UDF

阅读更多关于 spark custom kryo encoder not providing schema for UDF

问题 When following along with How to store custom objects in Dataset? and trying to register my own kryo encoder for a data frame I face an issue of Schema for type com.esri.core.geometry.Envelope is not supported There is a function which will parse a String (WKT) to an geometry object like: def mapWKTToEnvelope(wkt: String): Envelope = { val envBound = new Envelope() val spatialReference = SpatialReference.create(4326) // Parse the WKT String into a Geometry Object val ogcObj = OGCGeometry