spark-dataframe | 易学教程

Convert spark Dataframe with schema to dataframe of json String

阅读更多关于 Convert spark Dataframe with schema to dataframe of json String

问题 I have a Dataframe like this: +--+--------+--------+----+-------------+------------------------------+ |id|name |lastname|age |timestamp |creditcards | +--+--------+--------+----+-------------+------------------------------+ |1 |michel |blanc |35 |1496756626921|[[hr6,3569823], [ee3,1547869]]| |2 |peter |barns |25 |1496756626551|[[ye8,4569872], [qe5,3485762]]| +--+--------+--------+----+-------------+------------------------------+ where the schema of my df is like below: root |-- id: string

Developing scala spark app that connect to azure CosmosDB

阅读更多关于 Developing scala spark app that connect to azure CosmosDB

问题 Im working on developing scala spark app that connect to cosmosDB and can't resolve dependency within SBT. Whenever I include org.apache.spark it conflict with azure-cosmosdb-spark and if I take out org.apache.spark I can't get spark sparkSession to resolve. My SBT configurations : name := "MyApp" version := "1.0"`` scalaVersion := "2.11.8" libraryDependencies ++= Seq( "org.apache.spark" % "spark-core_2.11" % "2.3.0", "org.apache.spark" % "spark-sql_2.11" % "2.3.0" , "org.apache.spark" %

Write to Postgres from Dataricks using Python [duplicate]

阅读更多关于 Write to Postgres from Dataricks using Python [duplicate]

问题 This question already has answers here : How to use JDBC source to write and read data in (Py)Spark? (3 answers) Closed last year . I have a dataframe in Databricks called customerDetails. +--------------------+-----------+ | customerName| customerId| +--------------------+-----------+ |John Smith | 0001| |Jane Burns | 0002| |Frank Jones | 0003| +--------------------+-----------+ I would like to be able to copy this from Databricks to a table within Postgres. I found this post which used

Spark's dataframe count() function taking very long

阅读更多关于 Spark's dataframe count() function taking very long

问题 In my code, I have a sequence of dataframes where I want to filter out the dataframe's which are empty. I'm doing something like: Seq(df1, df2).map(df => df.count() > 0) However, this is taking extremely long and is consuming around 7 minutes for approximately 2 dataframe's of 100k rows each. My question: Why is Spark's implementation of count() is slow. Is there a work-around? 回答1: Count is a lazy operation. So it does not matter how big is your dataframe. But if you have too many costly

Spark - I get a java.lang.UnsupportedOperationException when I invoke a custom function from a map

阅读更多关于 Spark - I get a java.lang.UnsupportedOperationException when I invoke a custom function from a map

How to update pyspark dataframe metadata on Spark 2.1?

阅读更多关于 How to update pyspark dataframe metadata on Spark 2.1?

问题 I'm facing an issue with the OneHotEncoder of SparkML since it reads dataframe metadata in order to determine the value range it should assign for the sparse vector object its creating. More specifically, I'm encoding a "hour" field using a training set containing all individual values between 0 and 23. Now I'm scoring a single row data frame using the "transform" method od the Pipeline. Unfortunately, this leads to a differently encoded sparse vector object for the OneHotEncoder (24,[5],[1.0

Spark-Xml: Array within an Array in Dataframe to generate XML

阅读更多关于 Spark-Xml: Array within an Array in Dataframe to generate XML

问题 I have a requirement to generate a XML which has a below structure <parent> <name>parent</name <childs> <child> <name>child1</name> </child> <child> <name>child1</name> <grandchilds> <grandchild> <name>grand1</name> </grandchild> <grandchild> <name>grand2</name> </grandchild> <grandchild> <name>grand3</name> </grandchild> </grandchilds> </child> <child> <name>child1</name> </child> </childs> </parent> As you see a parent will have child(s) and a child node may have grandchild(s) nodes. https:

reading binary data into (py) spark DataFrame

阅读更多关于 reading binary data into (py) spark DataFrame

问题 I'm ingesting a binary file into Spark -- the file structure is simple, it consists of a series of records and each record holds a number of floats. At the moment, I'm reading in the data in chunks in python and then iterating through individual records to turn them into Row objects that Spark can use to construct a DataFrame . This is very inefficient because instead of processing the data in chunks it requires me to loop through the individual elements. Is there an obvious (preferred) way

How to use SparkSession and StreamingContext together?

阅读更多关于 How to use SparkSession and StreamingContext together?

问题 I'm trying to stream CSV files from a folder on my local machine (OSX). I'm using SparkSession and StreamingContext together like so: val sc: SparkContext = createSparkContext(sparkContextName) val sparkSess = SparkSession.builder().config(sc.getConf).getOrCreate() val ssc = new StreamingContext(sparkSess.sparkContext, Seconds(time)) val csvSchema = new StructType().add("field_name",StringType) val inputDF = sparkSess.readStream.format("org.apache.spark.csv").schema(csvSchema).csv("file://

Spark Scala Error while saving DataFrame to Hive

阅读更多关于 Spark Scala Error while saving DataFrame to Hive

问题 i have framed a DataFrame by combining multiple Arrays. I am trying to save this into a hive table, i am getting ArrayIndexOutofBound Exception. Following is the code and the Error i got. i tried with adding case class outside and inside main def but still getting the same error. import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{Row, SQLContext, DataFrame} import org.apache.spark.ml.feature.RFormula import java.text._ import java.util.Date import org.apache.hadoop