Extract column values of Dataframe as List in Apache Spark

后端未结

关注

 10  1067

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then

相关标签:

10条回答

刺人心

2020-12-22 17:29
```
from pyspark.sql.functions import col

df.select(col("column_name")).collect()
```
here collect is functions which in turn convert it to list. Be ware of using the list on the huge data set. It will decrease performance. It is good to check the data.
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2020-12-22 17:37
I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.

i.e. A DataFrame, containing a column named "Raw"

To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:
```
MyDataFrame.rdd.map(lambda x: x.Raw).collect()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

情歌与酒

2020-12-22 17:38

List<String> whatever_list = df.toJavaRDD().map(new Function<Row, String>() {
    public String call(Row row) {
        return row.getAs("column_name").toString();
    }
}).collect();

logger.info(String.format("list is %s",whatever_list)); //verification

Since no one has given any solution in java(Real Programming Language) Can thank me later

0 讨论(0)

春和景丽

2020-12-22 17:39
With Spark 2.x and Scala 2.11

I'd think of 3 possible ways to convert values of a specific column to List.

Common code snippets for all the approaches
```
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.getOrCreate    
import spark.implicits._ // for .toDF() method

val df = Seq(
    ("first", 2.0),
    ("test", 1.5), 
    ("choose", 8.0)
  ).toDF("id", "val")
```
Approach 1
```
df.select("id").collect().map(_(0)).toList
// res9: List[Any] = List(one, two, three)
```
What happens now? We are collecting data to Driver with collect() and picking element zero from each record.

This could not be an excellent way of doing it, Let's improve it with next approach.

Approach 2
```
df.select("id").rdd.map(r => r(0)).collect.toList 
//res10: List[Any] = List(one, two, three)
```
How is it better? We have distributed map transformation load among the workers rather than single Driver.

I know rdd.map(r => r(0)) does not seems elegant you. So, let's address it in next approach.

Approach 3
```
df.select("id").map(r => r.getString(0)).collect.toList 
//res11: List[String] = List(one, two, three)
```
Here we are not converting DataFrame to RDD. Look at map it won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in the next versions of Spark.

Conclusion

All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).

Databricks notebook
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

Extract column values of Dataframe as List in Apache Spark

With Spark 2.x and Scala 2.11

Common code snippets for all the approaches

Approach 1

Approach 2

Approach 3

Conclusion