spark-dataframe

Pyspark Dataframe get unique elements from column with string as list of elements

我的未来我决定 提交于 2021-02-19 07:34:05
问题 I have a dataframe (which is created by loading from multiple blobs in azure) where I have a column which is list of IDs. Now, I want a list of unique IDs from this entire column: Here is an example - df - | col1 | col2 | col3 | | "a" | "b" |"[q,r]"| | "c" | "f" |"[s,r]"| Here is my expected response: resp = [q, r, s] Any idea how to get there? My current approach is to convert the strings in col3 to python lists and then maybe flaten them out somehow. But so far I am not able to do so. I

Spark Get only columns that have one or more null values

混江龙づ霸主 提交于 2021-02-19 04:25:47
问题 From a dataframe I want to get names of columns which contain at least one null value inside. Considering the dataframe below: val dataset = sparkSession.createDataFrame(Seq( (7, null, 18, 1.0), (8, "CA", null, 0.0), (9, "NZ", 15, 0.0) )).toDF("id", "country", "hour", "clicked") I want to get column names 'Country' and 'Hour'. id country hour clicked 7 null 18 1 8 "CA" null 0 9 "NZ" 15 0 回答1: this is one solution, but it's a bit awkward, I hope there is an easier way: val cols = dataset

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

老子叫甜甜 提交于 2021-02-16 08:43:54
问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

感情迁移 提交于 2021-02-16 08:42:31
问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

99封情书 提交于 2021-02-16 08:42:11
问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+

Spark - Scope, Data Frame, and memory management

空扰寡人 提交于 2021-02-11 12:41:43
问题 I am curious about how scope works with Data Frame and Spark. In the example below, I have a list of file, each independently loaded in a Data Frame, some operation is performed, then, we write dfOutput to disk. val files = getListOfFiles("outputs/emailsSplit") for (file <- files){ val df = sqlContext.read .format("com.databricks.spark.csv") .option("delimiter","\t") // Delimiter is tab .option("parserLib", "UNIVOCITY") // Parser, which deals better with the email formatting .schema

Not able to set number of shuffle partition in pyspark

℡╲_俬逩灬. 提交于 2021-02-10 19:57:54
问题 I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6. I'm loading a fairly small table with about 37K rows from hive using the following in my notebook from pyspark.sql.functions import * sqlContext.sql("set spark.sql.shuffle.partitions=10") test= sqlContext.table('some_table') print test.rdd.getNumPartitions() print test.count() The output confirms 200 tasks. From the activity log, it's spinning up

Not able to set number of shuffle partition in pyspark

大兔子大兔子 提交于 2021-02-10 19:57:47
问题 I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6. I'm loading a fairly small table with about 37K rows from hive using the following in my notebook from pyspark.sql.functions import * sqlContext.sql("set spark.sql.shuffle.partitions=10") test= sqlContext.table('some_table') print test.rdd.getNumPartitions() print test.count() The output confirms 200 tasks. From the activity log, it's spinning up

Can Dataframe joins in Spark preserve order?

余生颓废 提交于 2021-02-08 13:50:43
问题 I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes. From Which operations preserve RDD order?, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions. How could one perform a join of two DataFrames while preserving the order of one table? E.g., +------------+---------

add columns in dataframes dynamically with column names as elements in List

青春壹個敷衍的年華 提交于 2021-02-08 08:06:42
问题 I have List[N] like below val check = List ("a","b","c","d") where N can be any number of elements. I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y) I have tried all possible ways, like withColumn , selectExpr , nothing works. Please consider substring(X,Y) where X and Y as some numbers based on some metadata Below are my different codes which I tried,