spark-dataframe | 易学教程

Convert date to end of month in Spark

阅读更多关于 Convert date to end of month in Spark

I have a Spark DataFrame as shown below: #Create DataFrame df <- data.frame(name = c("Thomas", "William", "Bill", "John"), dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08')) df <- createDataFrame(df) #Make sure df$dates column is in 'date' format df <- withColumn(df, 'dates', cast(df$dates, 'date')) name | dates -------------------- Thomas |2017-01-05 William |2017-02-23 Bill |2017-03-16 John |2017-04-08 I want to change dates to the end of month date, so they would look like shown below. How do I do this? Either SparkR or PySpark code is fine. name | dates --------------------

Spark 2.0, DataFrame, filter a string column, unequal operator (!==) is deprecated

阅读更多关于 Spark 2.0, DataFrame, filter a string column, unequal operator (!==) is deprecated

问题 I am trying to filter a DataFrame by keeping only those rows that have a certain string column non-empty. The operation is the following: df.filter($"stringColumn" !== "") My compiler shows that the !== is deprecated since I moved to Spark 2.0.1 How can I check if a string column value is empty in Spark > 2.0? 回答1: Use =!= as a replacement: df.filter($"stringColumn" =!= "") 来源： https://stackoverflow.com/questions/40154104/spark-2-0-dataframe-filter-a-string-column-unequal-operator-is-deprecat

Can I change the nullability of a column in my Spark dataframe?

阅读更多关于 Can I change the nullability of a column in my Spark dataframe?

I have a StructField in a dataframe that is not nullable. Simple example: import pyspark.sql.functions as F from pyspark.sql.types import * l = [('Alice', 1)] df = sqlContext.createDataFrame(l, ['name', 'age']) df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True)) df.schema.fields which returns: [StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)] Notice that the field foo is not nullable. Problem is that (for reasons I won't go into) I want it to be nullable. I found this post Change nullable property of column in spark

PySpark - Create DataFrame from Numpy Matrix

阅读更多关于 PySpark - Create DataFrame from Numpy Matrix

I have a numpy matrix: arr = np.array([[2,3], [2,8], [2,3],[4,5]]) I need to create a PySpark Dataframe from arr . I can not manually input the values because the length/values of arr will be changing dynamically so I need to convert arr into a dataframe. I tried the following code to no success. df= sqlContext.createDataFrame(arr,["A", "B"]) However, I get the following error. TypeError: Can not infer schema for type: <type 'numpy.ndarray'> Hope this helps! import numpy as np #sample data arr = np.array([[2,3], [2,8], [2,3],[4,5]]) rdd1 = sc.parallelize(arr) rdd2 = rdd1.map(lambda x: [int(i)

Workaround for importing spark implicits everywhere

阅读更多关于 Workaround for importing spark implicits everywhere

I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example: File A class A { def job(spark: SparkSession) = { import spark.implcits._ //create dataset ds val b = new B(spark) b.doSomething(ds) doSomething(ds) } private def doSomething(ds: Dataset[Foo], spark: SparkSession) = { import spark.implicits._ ds.map(e => 1) } } File B class B(spark: SparkSession) { def doSomething(ds: Dataset[Foo]) = { import spark.implicits._ ds.map(e => "SomeString") } } What I wanted to ask is if there's a cleaner way to

Effect of fetchsize and batchsize on Spark

阅读更多关于 Effect of fetchsize and batchsize on Spark

I want to control the reading and writing speed to an RDB by Spark directly, yet the related parameters as the title already revealed seemingly were not working. Can I conclude that fetchsize and batchsize didn't work with my testing method? Or they do affect on the facet of reading and writing since the measure result is reasonable based on scale. Stats of betchsize , fetchsize and data set /*Dataset*/ +--------------+-----------+ | Observations | Dataframe | +--------------+-----------+ | 109,077 | Initial | | 345,732 | Ultimate | +--------------+-----------+ /*fetchsize*/ +-----------+-----

extracting numpy array from Pyspark Dataframe

阅读更多关于 extracting numpy array from Pyspark Dataframe

问题 I have a dataframe gi_man_df where group can be n : +------------------+-----------------+--------+--------------+ | group | number|rand_int| rand_double| +------------------+-----------------+--------+--------------+ | 'GI_MAN'| 7| 3| 124.2| | 'GI_MAN'| 7| 10| 121.15| | 'GI_MAN'| 7| 11| 129.0| | 'GI_MAN'| 7| 12| 125.0| | 'GI_MAN'| 7| 13| 125.0| | 'GI_MAN'| 7| 21| 127.0| | 'GI_MAN'| 7| 22| 126.0| +------------------+-----------------+--------+--------------+ and I am expecting a numpy nd

Spark union fails with nested JSON dataframe

阅读更多关于 Spark union fails with nested JSON dataframe

I have the following two JSON files: { "name" : "Agent1", "age" : "32", "details" : [{ "d1" : 1, "d2" : 2 } ] } { "name" : "Agent2", "age" : "42", "details" : [] } I read them with spark: val jsonDf1 = spark.read.json(pathToJson1) val jsonDf2 = spark.read.json(pathToJson2) two dataframes are created with the following schemas: root |-- age: string (nullable = true) |-- details: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- d1: long (nullable = true) | | |-- d2: long (nullable = true) |-- name: string (nullable = true) root |-- age: string (nullable = true) |--

Convert List into dataframe spark scala

阅读更多关于 Convert List into dataframe spark scala

I have a list with more than 30 strings. how to convert list into dataframe . what i tried: eg Val list=List("a","b","v","b").toDS().toDF() Output : +-------+ | value| +-------+ |a | |b | |v | |b | +-------+ Expected Output is +---+---+---+---+ | _1| _2| _3| _4| +---+---+---+---+ | a| b| v| a| +---+---+---+---+ any help on this . List("a","b","c","d") represents a record with one field and so the resultset displays one element in each row. To get the expected output, the row should have four fields/elements in it. So, we wrap around the list as List(("a","b","c","d")) which represents one row,

How to use orderby() with descending order in Spark window functions?

阅读更多关于 How to use orderby() with descending order in Spark window functions?

问题 I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks. This works fine for ascending order: def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={ val top_keys: List[String] = top_key.split(", ").map(_.trim).toList val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*) .orderBy(top_value) val rankCondition = "rn < "+top_x.toString val dfTop = df.withColumn("rn",row