spark-dataframe

Convert date to end of month in Spark

久未见 提交于 2019-12-05 16:59:30
I have a Spark DataFrame as shown below: #Create DataFrame df <- data.frame(name = c("Thomas", "William", "Bill", "John"), dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08')) df <- createDataFrame(df) #Make sure df$dates column is in 'date' format df <- withColumn(df, 'dates', cast(df$dates, 'date')) name | dates -------------------- Thomas |2017-01-05 William |2017-02-23 Bill |2017-03-16 John |2017-04-08 I want to change dates to the end of month date, so they would look like shown below. How do I do this? Either SparkR or PySpark code is fine. name | dates --------------------

Spark 2.0, DataFrame, filter a string column, unequal operator (!==) is deprecated

。_饼干妹妹 提交于 2019-12-05 14:38:14
问题 I am trying to filter a DataFrame by keeping only those rows that have a certain string column non-empty. The operation is the following: df.filter($"stringColumn" !== "") My compiler shows that the !== is deprecated since I moved to Spark 2.0.1 How can I check if a string column value is empty in Spark > 2.0? 回答1: Use =!= as a replacement: df.filter($"stringColumn" =!= "") 来源: https://stackoverflow.com/questions/40154104/spark-2-0-dataframe-filter-a-string-column-unequal-operator-is-deprecat

Can I change the nullability of a column in my Spark dataframe?

烈酒焚心 提交于 2019-12-05 14:28:49
I have a StructField in a dataframe that is not nullable. Simple example: import pyspark.sql.functions as F from pyspark.sql.types import * l = [('Alice', 1)] df = sqlContext.createDataFrame(l, ['name', 'age']) df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True)) df.schema.fields which returns: [StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)] Notice that the field foo is not nullable. Problem is that (for reasons I won't go into) I want it to be nullable. I found this post Change nullable property of column in spark

PySpark - Create DataFrame from Numpy Matrix

久未见 提交于 2019-12-05 13:26:13
I have a numpy matrix: arr = np.array([[2,3], [2,8], [2,3],[4,5]]) I need to create a PySpark Dataframe from arr . I can not manually input the values because the length/values of arr will be changing dynamically so I need to convert arr into a dataframe. I tried the following code to no success. df= sqlContext.createDataFrame(arr,["A", "B"]) However, I get the following error. TypeError: Can not infer schema for type: <type 'numpy.ndarray'> Hope this helps! import numpy as np #sample data arr = np.array([[2,3], [2,8], [2,3],[4,5]]) rdd1 = sc.parallelize(arr) rdd2 = rdd1.map(lambda x: [int(i)

Workaround for importing spark implicits everywhere

ε祈祈猫儿з 提交于 2019-12-05 12:52:21
I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example: File A class A { def job(spark: SparkSession) = { import spark.implcits._ //create dataset ds val b = new B(spark) b.doSomething(ds) doSomething(ds) } private def doSomething(ds: Dataset[Foo], spark: SparkSession) = { import spark.implicits._ ds.map(e => 1) } } File B class B(spark: SparkSession) { def doSomething(ds: Dataset[Foo]) = { import spark.implicits._ ds.map(e => "SomeString") } } What I wanted to ask is if there's a cleaner way to

Effect of fetchsize and batchsize on Spark

我们两清 提交于 2019-12-05 12:22:53
I want to control the reading and writing speed to an RDB by Spark directly, yet the related parameters as the title already revealed seemingly were not working. Can I conclude that fetchsize and batchsize didn't work with my testing method? Or they do affect on the facet of reading and writing since the measure result is reasonable based on scale. Stats of betchsize , fetchsize and data set /*Dataset*/ +--------------+-----------+ | Observations | Dataframe | +--------------+-----------+ | 109,077 | Initial | | 345,732 | Ultimate | +--------------+-----------+ /*fetchsize*/ +-----------+-----

extracting numpy array from Pyspark Dataframe

孤者浪人 提交于 2019-12-05 11:35:13
问题 I have a dataframe gi_man_df where group can be n : +------------------+-----------------+--------+--------------+ | group | number|rand_int| rand_double| +------------------+-----------------+--------+--------------+ | 'GI_MAN'| 7| 3| 124.2| | 'GI_MAN'| 7| 10| 121.15| | 'GI_MAN'| 7| 11| 129.0| | 'GI_MAN'| 7| 12| 125.0| | 'GI_MAN'| 7| 13| 125.0| | 'GI_MAN'| 7| 21| 127.0| | 'GI_MAN'| 7| 22| 126.0| +------------------+-----------------+--------+--------------+ and I am expecting a numpy nd

Spark union fails with nested JSON dataframe

杀马特。学长 韩版系。学妹 提交于 2019-12-05 11:14:53
I have the following two JSON files: { "name" : "Agent1", "age" : "32", "details" : [{ "d1" : 1, "d2" : 2 } ] } { "name" : "Agent2", "age" : "42", "details" : [] } I read them with spark: val jsonDf1 = spark.read.json(pathToJson1) val jsonDf2 = spark.read.json(pathToJson2) two dataframes are created with the following schemas: root |-- age: string (nullable = true) |-- details: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- d1: long (nullable = true) | | |-- d2: long (nullable = true) |-- name: string (nullable = true) root |-- age: string (nullable = true) |--

Convert List into dataframe spark scala

有些话、适合烂在心里 提交于 2019-12-05 10:44:01
I have a list with more than 30 strings. how to convert list into dataframe . what i tried: eg Val list=List("a","b","v","b").toDS().toDF() Output : +-------+ | value| +-------+ |a | |b | |v | |b | +-------+ Expected Output is +---+---+---+---+ | _1| _2| _3| _4| +---+---+---+---+ | a| b| v| a| +---+---+---+---+ any help on this . List("a","b","c","d") represents a record with one field and so the resultset displays one element in each row. To get the expected output, the row should have four fields/elements in it. So, we wrap around the list as List(("a","b","c","d")) which represents one row,

How to use orderby() with descending order in Spark window functions?

混江龙づ霸主 提交于 2019-12-05 10:05:23
问题 I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks. This works fine for ascending order: def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={ val top_keys: List[String] = top_key.split(", ").map(_.trim).toList val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*) .orderBy(top_value) val rankCondition = "rn < "+top_x.toString val dfTop = df.withColumn("rn",row