spark-dataframe

Not able to create View on Hive table using HiveContext. Getting DBLock Manager issue

老子叫甜甜 提交于 2019-12-11 04:43:47
问题 I am not able to create View on Hive table using HiveContext . Facing DBLock Manager Lock issue. Same View creation query works fine in Hive Beeline. But its failing when executed using Hive Context. 17/02/23 10:44:18 INFO metastore: Trying to connect to metastore with URI thrift://XXXXXXXXXXXXXXXXXXXXXXXXXXX 17/02/23 10:44:18 INFO metastore: Connected to metastore. 17/02/23 10:44:18 INFO DbLockManager: Response to queryId=XXXXXXXX_20170223104411_2b1a475e-ad6d-45b3-8ec6-6a30a9123664

Spark Dataframes are getting created successfully but not able to write into the Local Disk

你说的曾经没有我的故事 提交于 2019-12-11 03:38:02
问题 I am using IntelliJ IDE for executing Spark Scala code on Microsoft Windows Platform. I have four Spark Dataframes of around 30000 records each and I tried to take one column from each of those Dataframes as part of my requirement. I used Spark SQL function to do it and it got executed successfully. When I execute DF.show() or DF.count() method, I am able to see results in the screen but when I tried to write the dataframe into my local disk (windows directory) the job is getting aborted with

Fastest way to check if DataFrame(Scala) is empty?

不羁岁月 提交于 2019-12-11 02:47:50
问题 How to check if DataFrame( Scala ) is empty in fastest way?I use DF.limit(1).rdd.isEmpty , faster than DF.rdd.isEmpty,but not ideal.Is there any better way to do that? 回答1: I usually wrap a call to first around a Try : import scala.util.Try val t = Try(df.first) From there you can match on it if it's a Success or Failure to control logic: import scala.util.{Success,Failure} t match { case Success(df) => //do stuff with the dataframe case Failure(e) => // dataframe is empty; do other stuff //e

Getting Empty set while reading data from kafka-Spark-Streaming

我的未来我决定 提交于 2019-12-11 02:25:55
问题 Hi i am new to Spark Streaming. i am trying to read the xml file and send it to kafka topic. Here is my Kafka Code Which sends data to Kafka-console-consumer. Code: package org.apache.kafka.Kafka_Producer; import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.util.Properties; import java.util.Properties; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutionException; import kafka

Extracting values from a Spark column containing nested values [duplicate]

怎甘沉沦 提交于 2019-12-11 01:57:27
问题 This question already has answers here : Querying Spark SQL DataFrame with complex types (3 answers) Closed last year . This is part of the schema of my mongodb collection: |-- variables: struct (nullable = true) | |-- actives: struct (nullable = true) | | |-- data: struct (nullable = true) | | | |-- 0: struct (nullable = true) | | | | |--active: integer (nullable = true) | | | | |-- inactive: integer (nullable = true) I've fetched the collection and stored it in a Spark dataframe and am now

Spark cast column to sql type stored in string

情到浓时终转凉″ 提交于 2019-12-11 01:32:04
问题 The simple request is I need help adding a column to a dataframe but, the column has to be empty, its type is from ...spark.sql.types and the type has to be defined from a string. I can probably do this with ifs or case but I'm looking for something more elegant. Something that does not require writing a case for every type in org.apache.spark.sql.types If I do this for example: df = df.withColumn("col_name", lit(null).cast(org.apache.spark.sql.types.StringType)) It works as intended, but I

Create a column in a PySpark dataframe using a list whose indices are present in one column of the dataframe

半世苍凉 提交于 2019-12-11 01:22:31
问题 I'm new to Python and PySpark. I have a dataframe in PySpark like the following: ## +---+---+------+ ## | x1| x2| x3 | ## +---+---+------+ ## | 0| a | 13.0| ## | 2| B | -33.0| ## | 1| B | -63.0| ## +---+---+------+ I have an array: arr = [10, 12, 13] I want to create a column x4 in the dataframe such that it should have the corresponding values from the list based on the values of x1 as indices. The final dataset should look like: ## +---+---+------+-----+ ## | x1| x2| x3 | x4 | ## +---+---+-

Programmatically generate the schema AND the data for a dataframe in Apache Spark

穿精又带淫゛_ 提交于 2019-12-11 00:19:38
问题 I would like to dynamically generate a dataframe containing a header record for a report, so creating a dataframe from the value of the string below: val headerDescs : String = "Name,Age,Location" val headerSchema = StructType(headerDescs.split(",").map(fieldName => StructField(fieldName, StringType, true))) However now I want to do the same for the data (which is in effect the same data i.e. the metadata). I create an RDD : val headerRDD = sc.parallelize(headerDescs.split(",")) I then

How to add jar using HiveContext in the spark job

孤街醉人 提交于 2019-12-10 23:46:41
问题 I am trying to add JSONSerDe jar file to in order to access the json data load the JSON data to hive table from the spark job. My code is shown below: SparkConf sparkConf = new SparkConf().setAppName("KafkaStreamToHbase"); JavaSparkContext sc = new JavaSparkContext(sparkConf); JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(10)); final SQLContext sqlContext = new SQLContext(sc); final HiveContext hiveContext = new HiveContext(sc); hiveContext.sql("ADD JAR hdfs:/

spark custom kryo encoder not providing schema for UDF

为君一笑 提交于 2019-12-10 23:39:35
问题 When following along with How to store custom objects in Dataset? and trying to register my own kryo encoder for a data frame I face an issue of Schema for type com.esri.core.geometry.Envelope is not supported There is a function which will parse a String (WKT) to an geometry object like: def mapWKTToEnvelope(wkt: String): Envelope = { val envBound = new Envelope() val spatialReference = SpatialReference.create(4326) // Parse the WKT String into a Geometry Object val ogcObj = OGCGeometry