spark-dataframe

Convert spark Dataframe with schema to dataframe of json String

三世轮回 提交于 2019-12-11 17:00:11
问题 I have a Dataframe like this: +--+--------+--------+----+-------------+------------------------------+ |id|name |lastname|age |timestamp |creditcards | +--+--------+--------+----+-------------+------------------------------+ |1 |michel |blanc |35 |1496756626921|[[hr6,3569823], [ee3,1547869]]| |2 |peter |barns |25 |1496756626551|[[ye8,4569872], [qe5,3485762]]| +--+--------+--------+----+-------------+------------------------------+ where the schema of my df is like below: root |-- id: string

Developing scala spark app that connect to azure CosmosDB

筅森魡賤 提交于 2019-12-11 15:49:23
问题 Im working on developing scala spark app that connect to cosmosDB and can't resolve dependency within SBT. Whenever I include org.apache.spark it conflict with azure-cosmosdb-spark and if I take out org.apache.spark I can't get spark sparkSession to resolve. My SBT configurations : name := "MyApp" version := "1.0"`` scalaVersion := "2.11.8" libraryDependencies ++= Seq( "org.apache.spark" % "spark-core_2.11" % "2.3.0", "org.apache.spark" % "spark-sql_2.11" % "2.3.0" , "org.apache.spark" %

Write to Postgres from Dataricks using Python [duplicate]

邮差的信 提交于 2019-12-11 15:48:00
问题 This question already has answers here : How to use JDBC source to write and read data in (Py)Spark? (3 answers) Closed last year . I have a dataframe in Databricks called customerDetails. +--------------------+-----------+ | customerName| customerId| +--------------------+-----------+ |John Smith | 0001| |Jane Burns | 0002| |Frank Jones | 0003| +--------------------+-----------+ I would like to be able to copy this from Databricks to a table within Postgres. I found this post which used

Spark's dataframe count() function taking very long

余生颓废 提交于 2019-12-11 15:35:25
问题 In my code, I have a sequence of dataframes where I want to filter out the dataframe's which are empty. I'm doing something like: Seq(df1, df2).map(df => df.count() > 0) However, this is taking extremely long and is consuming around 7 minutes for approximately 2 dataframe's of 100k rows each. My question: Why is Spark's implementation of count() is slow. Is there a work-around? 回答1: Count is a lazy operation. So it does not matter how big is your dataframe. But if you have too many costly

Spark - I get a java.lang.UnsupportedOperationException when I invoke a custom function from a map

橙三吉。 提交于 2019-12-11 15:32:40
问题 I have a DataFrame with a structure similar to: root |-- NPAData: struct (nullable = true) | |-- NPADetails: struct (nullable = true) | | |-- location: string (nullable = true) | | |-- manager: string (nullable = true) | |-- service: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- serviceName: string (nullable = true) | | | |-- serviceCode: string (nullable = true) |-- NPAHeader: struct (nullable = true) | | |-- npaNumber: string (nullable = true) | | |-- date:

How to update pyspark dataframe metadata on Spark 2.1?

主宰稳场 提交于 2019-12-11 14:57:40
问题 I'm facing an issue with the OneHotEncoder of SparkML since it reads dataframe metadata in order to determine the value range it should assign for the sparse vector object its creating. More specifically, I'm encoding a "hour" field using a training set containing all individual values between 0 and 23. Now I'm scoring a single row data frame using the "transform" method od the Pipeline. Unfortunately, this leads to a differently encoded sparse vector object for the OneHotEncoder (24,[5],[1.0

Spark-Xml: Array within an Array in Dataframe to generate XML

走远了吗. 提交于 2019-12-11 14:56:14
问题 I have a requirement to generate a XML which has a below structure <parent> <name>parent</name <childs> <child> <name>child1</name> </child> <child> <name>child1</name> <grandchilds> <grandchild> <name>grand1</name> </grandchild> <grandchild> <name>grand2</name> </grandchild> <grandchild> <name>grand3</name> </grandchild> </grandchilds> </child> <child> <name>child1</name> </child> </childs> </parent> As you see a parent will have child(s) and a child node may have grandchild(s) nodes. https:

reading binary data into (py) spark DataFrame

时光总嘲笑我的痴心妄想 提交于 2019-12-11 13:55:09
问题 I'm ingesting a binary file into Spark -- the file structure is simple, it consists of a series of records and each record holds a number of floats. At the moment, I'm reading in the data in chunks in python and then iterating through individual records to turn them into Row objects that Spark can use to construct a DataFrame . This is very inefficient because instead of processing the data in chunks it requires me to loop through the individual elements. Is there an obvious (preferred) way

How to use SparkSession and StreamingContext together?

吃可爱长大的小学妹 提交于 2019-12-11 12:48:56
问题 I'm trying to stream CSV files from a folder on my local machine (OSX). I'm using SparkSession and StreamingContext together like so: val sc: SparkContext = createSparkContext(sparkContextName) val sparkSess = SparkSession.builder().config(sc.getConf).getOrCreate() val ssc = new StreamingContext(sparkSess.sparkContext, Seconds(time)) val csvSchema = new StructType().add("field_name",StringType) val inputDF = sparkSess.readStream.format("org.apache.spark.csv").schema(csvSchema).csv("file://

Spark Scala Error while saving DataFrame to Hive

こ雲淡風輕ζ 提交于 2019-12-11 12:16:34
问题 i have framed a DataFrame by combining multiple Arrays. I am trying to save this into a hive table, i am getting ArrayIndexOutofBound Exception. Following is the code and the Error i got. i tried with adding case class outside and inside main def but still getting the same error. import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{Row, SQLContext, DataFrame} import org.apache.spark.ml.feature.RFormula import java.text._ import java.util.Date import org.apache.hadoop